-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from maxrousseau/gpt_simple
autoregressive LM and v0.1 release
- Loading branch information
Showing
25 changed files
with
5,188 additions
and
381 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,61 +1,170 @@ | ||
<div class="header" align="center"> | ||
|
||
# rafale | ||
|
||
Rafale is (for now) a simple and opinionated transformer encoder training CLI. | ||
<div class="logo"> | ||
<p align="center"> | ||
<img src="./media/rafale-logo.png" alt="rafale-logo" width="200px" /> | ||
<br> | ||
Rafale is a simple and opinionated transformer training CLI. | ||
</p> | ||
</div> | ||
|
||
## Dependencies | ||
</div> | ||
|
||
Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes. | ||
## 💡Purpose | ||
|
||
``` | ||
torch, lightning-fabric (or) accelerate, datasets, rich (eyecandy) ~~tokenizers will be removed~~ | ||
``` | ||
Rafale provides an opinionated scaffolding for training transformers. It is solely built to be an efficient | ||
learning/research tool. It is **not** a fully fledged library for large scale training. | ||
|
||
@TODO :: (check out this stream on HF accelerate)[https://www.youtube.com/watch?v=X-Jx5-YskKY] | ||
It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. The best way to | ||
use rafale is to simply fork it and build on top of it for your specific purposes. | ||
|
||
### Core dependencies | ||
|
||
## Purpose | ||
Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes. | ||
|
||
This package is solely built to be an efficient research tool. It will not support data preprocessing/handling | ||
pipelines. It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. | ||
``` | ||
torch, composer, datasets, tokenizers | ||
``` | ||
|
||
Should be used pip installable via git and setup to be easily hackable to build on top of it. | ||
## 🚀 Installation & Usage | ||
|
||
Datasets should be preshuffled and pretokenized, only load it from disk and feed it to the dataloader with the collator | ||
function. | ||
Setup with ```uv``` ([install uv](https://github.com/astral-sh/uv)). | ||
```sh | ||
$ git clone <repo url> | ||
$ cd rafale | ||
$ uv venv | ||
$ . .venv/bin/activate | ||
$ uv pip install -r cuda-requirements.txt (or cpu-requirements.txt) | ||
$ uv pip install -e . | ||
``` | ||
|
||
Launch a run with a configuration. | ||
|
||
## Usage | ||
```sh | ||
$ python rafale/main -c test/pythia_tinystories.yaml | ||
``` | ||
|
||
Mostly optimized for SLURM clusters. | ||
What if I just want to prepare my dataset? ```DATA=1``` will run the data preparation and caching pipeline without | ||
launching the training run. | ||
|
||
```sh | ||
$ DATA=1 python rafale/main -c test/pythia_tinystories.yaml | ||
``` | ||
|
||
rafale run -c config.yaml # set DEBUG=1 for a sanity check | ||
What if I want to test my model to make sure that its learning? ```DEBUG=1``` will run 10 epochs on a single training | ||
batch (same for train/eval), the model should fit quickly if there are no bugs in the implementation. | ||
|
||
```sh | ||
$ DEBUG=1 python rafale/main -c test/pythia_tinystories.yaml | ||
``` | ||
|
||
## Roadmap | ||
|
||
v0.1 | ||
- [ ] Local model weight loading | ||
- [ ] load weights from safetensors and include it in the config (BERT and GPT2) | ||
- [ ] integration with lighteval (?) | ||
- [ ] Logging/Progress/Debug outputs with Rich library | ||
- ~~RoBERTa BPE tokenizer with TikToken (compare w/ HF), on the fly tokenization to be handled by dataloader's | ||
collator (for MLM)~~ | ||
- ~~model will be tied to the tokenizer, so dataloader will be defined after the model and use it's tokenizer~~ | ||
- We don't want anything to do with preprocessing, all data should be split/augmented/shuffled/tokenized/etc. All we | ||
do with this tool is load it from disk, turn it to a tensor and send it to the model | ||
- [ ] Local dataloader | ||
- [ ] ```debug``` single batch debug | ||
- [ ] ```main.py``` handles both training and evaluation (together or separately) | ||
- [-] BERT/RoBERTa support (MLM objective) | ||
+ [ ] move the testing in the notebook to a debug file in the modeling folder | ||
+ **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr) | ||
+ optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer | ||
+ RMSNorm | ||
- [ ] simple trainer (see lightning-fabric simple trainer example and start from there) | ||
+ bf16/fp16, gradient clipping, and gradient accumulation | ||
|
||
v0.2 | ||
- DeepSpeed ZeRO | ||
- Streaming dataloader | ||
|
||
### 🔧 Under the hood | ||
|
||
The goal of rafale is to provide a single entry point for data preparation and training. You configure the model and | ||
dataset. Then call the training job. | ||
|
||
When calling a run, first we run the datapipepline. If the dataset has already been processed (tokenized, padded, | ||
chunked, etc.), it will be loaded from the cache (default location is ```~/.rafale_cache```. | ||
|
||
> [!NOTE] | ||
> #### Adding a new model | ||
> To add a new model, you need write a new configuration to ```rafale/models/configurations.py```, and add it's key to | ||
> ```model_config_dict``` in ```rafale/main.py```. | ||
> | ||
> Look at the ```ComposerLM``` wrapper class in ```rafale/models/decoder.py``` to check if all your building blocks are | ||
> there. Otherwise you may need to modify/write a new wrapper. | ||
> | ||
> #### Adding a new datapipeline | ||
> | ||
> If the dataset is hosted on huggingface, simply use git lfs to clone the repo locally or use the repo name as the | ||
> dataset path. Same goes for tokenizers since we use their tokenizer implementation. | ||
> | ||
> You will need to add a new datapipeline class in ```rafale/datapipes.py``` where the ```_prepare``` method all data | ||
> preprocessing (tokenization, chunking, truncation, etc.) **EXCEPT** padding. Padding will be performed by the datacollator. | ||
### 📕 Docs | ||
|
||
Append this file ```llm-docprompt.txt``` to your favorite LLM and ask away. | ||
|
||
### 🦾 Supported models | ||
|
||
|
||
| Name | Implemented | Inference test | Training test | | ||
|:------------|:------------|:---------------|:--------------| | ||
| BERT | ✅ | | | | ||
| RoBERTa | ✅ | | | | ||
| Pythia | ✅ | ✅ | ✅ | | ||
| CLIP/SigLIP | ⏳ | | | | ||
|
||
|
||
## 🔮 Roadmap | ||
|
||
<details> | ||
<summary>v0.1</summary> | ||
|
||
|
||
### v0.1 - initial release | ||
- [x] single entrypoint CLI | ||
- [ ] simple deploy/build | ||
- [x] CPU macos build - Ok, uv run works with this | ||
- [x] local linux machine - for now uv for venv + requirements.txt | ||
- [ ] SLURM compute-canada - TBD | ||
- NOTE: because uv still does not fully play well with pytorch recommend semi-manual setup* | ||
- [ ] load weights from safetensors and include it in the config (BERT/RoBERTa and Pythia) | ||
- [x] pythia | ||
- [ ] BERT/RoBERTa (need to move from HF to safetensors) | ||
- [ ] MLM | ||
- [ ] Classification | ||
- [x] Pythia KV-cache implementation | ||
- [x] greedy generation | ||
- [ ] datapipes for CLM and MLM | ||
- local dataloader for now | ||
- [x] CLM tinystories | ||
- [ ] MLM tinystories | ||
- [ ] Imdb classification | ||
- [x] ```main.py``` handles both training and evaluation (together or separately) | ||
- [x] Mosaic Composer/Trainer | ||
+ [x] fp16 | ||
+ [x] gradient clipping | ||
+ [x] gradient accumulation (automatically handled by composer) | ||
+ [x] building blocks are nn.Modules, specific models are ComposerModel classes with methods to load safetensor weights | ||
automatically (keep in a single separate file for each model) | ||
+ [x] set DEBUG=1 for 1 batch sanity check before launching a run | ||
|
||
Datapipelines | ||
1. [x] tokenize | ||
2. [x] concat and split w/ block size (pad w/ collator) | ||
3. [x] save to disk {source}_{tokname}_bs{int}_len{int} | ||
4. [x] data_collator: *next* pad (if desired), label shift right and return torch tensor # HF: does this in the model... | ||
5. [x] test with model training | ||
6. [ ] tiny stories but for MLM also | ||
</details> | ||
|
||
<details> | ||
<summary>v1.0</summary> | ||
|
||
### path to v1.0 | ||
cleanup and additional features | ||
- [ ] clean up ```tests``` for pythia and bert models on tinystories | ||
- [ ] move the testing in the notebook to a debug file in the modeling folder | ||
- [ ] optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer | ||
- [ ] try out schedulefree, SOAP, and other optimizers | ||
- [ ] **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr) | ||
- [ ] multimodality CLIP | ||
- [ ] integration with lm-eval-harness (guide)[https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage] | ||
|
||
</details> | ||
|
||
## I am GPU-rich what do I use? | ||
|
||
For large scale experiments other frameworks/libraries exist: | ||
- lingua (Facebookresearch) | ||
- torchtitan (Pytorch) | ||
- torchtune (Pytorch) | ||
- litGPT (LightningAI) | ||
- GPT-NeoX (EleutherAI) | ||
- nanotron (Huggingface) | ||
- llm-foundry (MosaicML) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
aiohappyeyeballs==2.4.3 | ||
aiohttp==3.10.10 | ||
aiosignal==1.3.1 | ||
anyio==4.6.2.post1 | ||
argcomplete==3.5.1 | ||
arrow==1.3.0 | ||
attrs==24.2.0 | ||
backoff==2.2.1 | ||
certifi==2024.8.30 | ||
charset-normalizer==3.4.0 | ||
click==8.1.7 | ||
composer==0.26.0 | ||
coolname==2.2.0 | ||
datasets==3.0.2 | ||
dill==0.3.8 | ||
docker-pycreds==0.4.0 | ||
filelock==3.16.1 | ||
frozenlist==1.5.0 | ||
fsspec==2024.9.0 | ||
gitdb==4.0.11 | ||
gitpython==3.1.43 | ||
gql==3.5.0 | ||
graphql-core==3.2.5 | ||
huggingface-hub==0.26.2 | ||
idna==3.10 | ||
importlib-metadata==8.5.0 | ||
jinja2==3.1.4 | ||
lightning-utilities==0.11.8 | ||
markdown-it-py==3.0.0 | ||
markupsafe==3.0.2 | ||
mdurl==0.1.2 | ||
mosaicml-cli==0.6.42 | ||
mpmath==1.3.0 | ||
multidict==6.1.0 | ||
multiprocess==0.70.16 | ||
networkx==3.4.2 | ||
numpy==2.1.2 | ||
nvidia-cublas-cu12==12.1.3.1 | ||
nvidia-cuda-cupti-cu12==12.1.105 | ||
nvidia-cuda-nvrtc-cu12==12.1.105 | ||
nvidia-cuda-runtime-cu12==12.1.105 | ||
nvidia-cudnn-cu12==9.1.0.70 | ||
nvidia-cufft-cu12==11.0.2.54 | ||
nvidia-curand-cu12==10.3.2.106 | ||
nvidia-cusolver-cu12==11.4.5.107 | ||
nvidia-cusparse-cu12==12.1.0.106 | ||
nvidia-nccl-cu12==2.20.5 | ||
nvidia-nvjitlink-cu12==12.6.77 | ||
nvidia-nvtx-cu12==12.1.105 | ||
packaging==24.1 | ||
pandas==2.2.3 | ||
pillow==10.4.0 | ||
platformdirs==4.3.6 | ||
prompt-toolkit==3.0.48 | ||
propcache==0.2.0 | ||
protobuf==5.28.3 | ||
psutil==6.1.0 | ||
py-cpuinfo==9.0.0 | ||
pyarrow==18.0.0 | ||
pygments==2.18.0 | ||
python-dateutil==2.9.0.post0 | ||
pytorch-ranger==0.1.1 | ||
pytz==2024.2 | ||
pyyaml==6.0.2 | ||
questionary==1.10.0 | ||
-e file:///home/max/code/rafale | ||
requests==2.32.3 | ||
rich==13.9.3 | ||
ruamel-yaml==0.18.6 | ||
ruamel-yaml-clib==0.2.12 | ||
safetensors==0.4.5 | ||
sentry-sdk==2.17.0 | ||
setproctitle==1.3.3 | ||
setuptools==75.3.0 | ||
six==1.16.0 | ||
smmap==5.0.1 | ||
sniffio==1.3.1 | ||
sympy==1.13.3 | ||
tabulate==0.9.0 | ||
termcolor==2.5.0 | ||
tokenizers==0.20.1 | ||
torch==2.4.0 | ||
torch-optimizer==0.3.0 | ||
torchmetrics==1.4.0.post0 | ||
torchvision==0.19.0 | ||
tqdm==4.66.6 | ||
triton==3.0.0 | ||
types-python-dateutil==2.9.0.20241003 | ||
typing-extensions==4.12.2 | ||
tzdata==2024.2 | ||
urllib3==2.2.3 | ||
validators==0.34.0 | ||
wandb==0.18.5 | ||
wcwidth==0.2.13 | ||
websockets==11.0.3 | ||
xxhash==3.5.0 | ||
yarl==1.17.1 | ||
zipp==3.20.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
#!/bin/bash | ||
|
||
# Output file | ||
OUTPUT_FILE="llm-docprompt.txt" | ||
|
||
# Clear the output file if it exists | ||
> "$OUTPUT_FILE" | ||
|
||
# Add repo structure, excluding unwanted directories like .venv | ||
echo "### Repository Structure ###" >> "$OUTPUT_FILE" | ||
tree . -I 'wandb|*__pycache__|media|*-requirements.txt|.venv' >> "$OUTPUT_FILE" | ||
echo -e "\n\n" >> "$OUTPUT_FILE" | ||
|
||
# Include README.md content | ||
if [[ -f "README.md" ]]; then | ||
echo "### README.md ###" >> "$OUTPUT_FILE" | ||
cat README.md >> "$OUTPUT_FILE" | ||
echo -e "\n\n" >> "$OUTPUT_FILE" | ||
fi | ||
|
||
# Function to include content of a given file type, excluding .venv directory | ||
include_files() { | ||
local pattern="$1" | ||
local header="$2" | ||
|
||
find . -type f -name "$pattern" ! -path "./.venv/*" | while read -r file; do | ||
echo "### $file ###" >> "$OUTPUT_FILE" | ||
cat "$file" >> "$OUTPUT_FILE" | ||
echo -e "\n\n" >> "$OUTPUT_FILE" | ||
done | ||
} | ||
|
||
# Include Python files, excluding those in .venv | ||
include_files "*.py" "Python File" | ||
|
||
# Include YAML files only from the 'test' folder | ||
find ./test -type f -name "*.yaml" | while read -r yaml_file; do | ||
echo "### $yaml_file ###" >> "$OUTPUT_FILE" | ||
cat "$yaml_file" >> "$OUTPUT_FILE" | ||
echo -e "\n\n" >> "$OUTPUT_FILE" | ||
done | ||
|
||
echo "Documentation prompt has been generated in $OUTPUT_FILE" |
Oops, something went wrong.