Skip to content

Commit

Permalink
Merge pull request #1 from maxrousseau/gpt_simple
Browse files Browse the repository at this point in the history
autoregressive LM and v0.1 release
  • Loading branch information
maxrousseau authored Nov 9, 2024
2 parents 46eafac + fb1d58b commit 7759adb
Show file tree
Hide file tree
Showing 25 changed files with 5,188 additions and 381 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
__pycache__/
*.py[cod]
*$py.class
wandb/

*.DS_Store

Expand Down Expand Up @@ -160,3 +161,6 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# uv
uv.lock
193 changes: 151 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,170 @@
<div class="header" align="center">

# rafale

Rafale is (for now) a simple and opinionated transformer encoder training CLI.
<div class="logo">
<p align="center">
<img src="./media/rafale-logo.png" alt="rafale-logo" width="200px" />
<br>
Rafale is a simple and opinionated transformer training CLI.
</p>
</div>

## Dependencies
</div>

Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.
## 💡Purpose

```
torch, lightning-fabric (or) accelerate, datasets, rich (eyecandy) ~~tokenizers will be removed~~
```
Rafale provides an opinionated scaffolding for training transformers. It is solely built to be an efficient
learning/research tool. It is **not** a fully fledged library for large scale training.

@TODO :: (check out this stream on HF accelerate)[https://www.youtube.com/watch?v=X-Jx5-YskKY]
It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. The best way to
use rafale is to simply fork it and build on top of it for your specific purposes.

### Core dependencies

## Purpose
Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.

This package is solely built to be an efficient research tool. It will not support data preprocessing/handling
pipelines. It should be thought of as a starting point for research projects to bootstrap experiments on small LMs.
```
torch, composer, datasets, tokenizers
```

Should be used pip installable via git and setup to be easily hackable to build on top of it.
## 🚀 Installation & Usage

Datasets should be preshuffled and pretokenized, only load it from disk and feed it to the dataloader with the collator
function.
Setup with ```uv``` ([install uv](https://github.com/astral-sh/uv)).
```sh
$ git clone <repo url>
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt (or cpu-requirements.txt)
$ uv pip install -e .
```

Launch a run with a configuration.

## Usage
```sh
$ python rafale/main -c test/pythia_tinystories.yaml
```

Mostly optimized for SLURM clusters.
What if I just want to prepare my dataset? ```DATA=1``` will run the data preparation and caching pipeline without
launching the training run.

```sh
$ DATA=1 python rafale/main -c test/pythia_tinystories.yaml
```

rafale run -c config.yaml # set DEBUG=1 for a sanity check
What if I want to test my model to make sure that its learning? ```DEBUG=1``` will run 10 epochs on a single training
batch (same for train/eval), the model should fit quickly if there are no bugs in the implementation.

```sh
$ DEBUG=1 python rafale/main -c test/pythia_tinystories.yaml
```

## Roadmap

v0.1
- [ ] Local model weight loading
- [ ] load weights from safetensors and include it in the config (BERT and GPT2)
- [ ] integration with lighteval (?)
- [ ] Logging/Progress/Debug outputs with Rich library
- ~~RoBERTa BPE tokenizer with TikToken (compare w/ HF), on the fly tokenization to be handled by dataloader's
collator (for MLM)~~
- ~~model will be tied to the tokenizer, so dataloader will be defined after the model and use it's tokenizer~~
- We don't want anything to do with preprocessing, all data should be split/augmented/shuffled/tokenized/etc. All we
do with this tool is load it from disk, turn it to a tensor and send it to the model
- [ ] Local dataloader
- [ ] ```debug``` single batch debug
- [ ] ```main.py``` handles both training and evaluation (together or separately)
- [-] BERT/RoBERTa support (MLM objective)
+ [ ] move the testing in the notebook to a debug file in the modeling folder
+ **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
+ optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
+ RMSNorm
- [ ] simple trainer (see lightning-fabric simple trainer example and start from there)
+ bf16/fp16, gradient clipping, and gradient accumulation

v0.2
- DeepSpeed ZeRO
- Streaming dataloader

### 🔧 Under the hood

The goal of rafale is to provide a single entry point for data preparation and training. You configure the model and
dataset. Then call the training job.

When calling a run, first we run the datapipepline. If the dataset has already been processed (tokenized, padded,
chunked, etc.), it will be loaded from the cache (default location is ```~/.rafale_cache```.

> [!NOTE]
> #### Adding a new model
> To add a new model, you need write a new configuration to ```rafale/models/configurations.py```, and add it's key to
> ```model_config_dict``` in ```rafale/main.py```.
>
> Look at the ```ComposerLM``` wrapper class in ```rafale/models/decoder.py``` to check if all your building blocks are
> there. Otherwise you may need to modify/write a new wrapper.
>
> #### Adding a new datapipeline
>
> If the dataset is hosted on huggingface, simply use git lfs to clone the repo locally or use the repo name as the
> dataset path. Same goes for tokenizers since we use their tokenizer implementation.
>
> You will need to add a new datapipeline class in ```rafale/datapipes.py``` where the ```_prepare``` method all data
> preprocessing (tokenization, chunking, truncation, etc.) **EXCEPT** padding. Padding will be performed by the datacollator.
### 📕 Docs

Append this file ```llm-docprompt.txt``` to your favorite LLM and ask away.

### 🦾 Supported models


| Name | Implemented | Inference test | Training test |
|:------------|:------------|:---------------|:--------------|
| BERT || | |
| RoBERTa || | |
| Pythia ||||
| CLIP/SigLIP || | |


## 🔮 Roadmap

<details>
<summary>v0.1</summary>


### v0.1 - initial release
- [x] single entrypoint CLI
- [ ] simple deploy/build
- [x] CPU macos build - Ok, uv run works with this
- [x] local linux machine - for now uv for venv + requirements.txt
- [ ] SLURM compute-canada - TBD
- NOTE: because uv still does not fully play well with pytorch recommend semi-manual setup*
- [ ] load weights from safetensors and include it in the config (BERT/RoBERTa and Pythia)
- [x] pythia
- [ ] BERT/RoBERTa (need to move from HF to safetensors)
- [ ] MLM
- [ ] Classification
- [x] Pythia KV-cache implementation
- [x] greedy generation
- [ ] datapipes for CLM and MLM
- local dataloader for now
- [x] CLM tinystories
- [ ] MLM tinystories
- [ ] Imdb classification
- [x] ```main.py``` handles both training and evaluation (together or separately)
- [x] Mosaic Composer/Trainer
+ [x] fp16
+ [x] gradient clipping
+ [x] gradient accumulation (automatically handled by composer)
+ [x] building blocks are nn.Modules, specific models are ComposerModel classes with methods to load safetensor weights
automatically (keep in a single separate file for each model)
+ [x] set DEBUG=1 for 1 batch sanity check before launching a run

Datapipelines
1. [x] tokenize
2. [x] concat and split w/ block size (pad w/ collator)
3. [x] save to disk {source}_{tokname}_bs{int}_len{int}
4. [x] data_collator: *next* pad (if desired), label shift right and return torch tensor # HF: does this in the model...
5. [x] test with model training
6. [ ] tiny stories but for MLM also
</details>

<details>
<summary>v1.0</summary>

### path to v1.0
cleanup and additional features
- [ ] clean up ```tests``` for pythia and bert models on tinystories
- [ ] move the testing in the notebook to a debug file in the modeling folder
- [ ] optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
- [ ] try out schedulefree, SOAP, and other optimizers
- [ ] **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
- [ ] multimodality CLIP
- [ ] integration with lm-eval-harness (guide)[https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage]

</details>

## I am GPU-rich what do I use?

For large scale experiments other frameworks/libraries exist:
- lingua (Facebookresearch)
- torchtitan (Pytorch)
- torchtune (Pytorch)
- litGPT (LightningAI)
- GPT-NeoX (EleutherAI)
- nanotron (Huggingface)
- llm-foundry (MosaicML)
Empty file added cpu-requirements.txt
Empty file.
98 changes: 98 additions & 0 deletions cuda-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
anyio==4.6.2.post1
argcomplete==3.5.1
arrow==1.3.0
attrs==24.2.0
backoff==2.2.1
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
composer==0.26.0
coolname==2.2.0
datasets==3.0.2
dill==0.3.8
docker-pycreds==0.4.0
filelock==3.16.1
frozenlist==1.5.0
fsspec==2024.9.0
gitdb==4.0.11
gitpython==3.1.43
gql==3.5.0
graphql-core==3.2.5
huggingface-hub==0.26.2
idna==3.10
importlib-metadata==8.5.0
jinja2==3.1.4
lightning-utilities==0.11.8
markdown-it-py==3.0.0
markupsafe==3.0.2
mdurl==0.1.2
mosaicml-cli==0.6.42
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.2
numpy==2.1.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
packaging==24.1
pandas==2.2.3
pillow==10.4.0
platformdirs==4.3.6
prompt-toolkit==3.0.48
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
py-cpuinfo==9.0.0
pyarrow==18.0.0
pygments==2.18.0
python-dateutil==2.9.0.post0
pytorch-ranger==0.1.1
pytz==2024.2
pyyaml==6.0.2
questionary==1.10.0
-e file:///home/max/code/rafale
requests==2.32.3
rich==13.9.3
ruamel-yaml==0.18.6
ruamel-yaml-clib==0.2.12
safetensors==0.4.5
sentry-sdk==2.17.0
setproctitle==1.3.3
setuptools==75.3.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sympy==1.13.3
tabulate==0.9.0
termcolor==2.5.0
tokenizers==0.20.1
torch==2.4.0
torch-optimizer==0.3.0
torchmetrics==1.4.0.post0
torchvision==0.19.0
tqdm==4.66.6
triton==3.0.0
types-python-dateutil==2.9.0.20241003
typing-extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
validators==0.34.0
wandb==0.18.5
wcwidth==0.2.13
websockets==11.0.3
xxhash==3.5.0
yarl==1.17.1
zipp==3.20.2
43 changes: 43 additions & 0 deletions generate_llm_docprompt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

# Output file
OUTPUT_FILE="llm-docprompt.txt"

# Clear the output file if it exists
> "$OUTPUT_FILE"

# Add repo structure, excluding unwanted directories like .venv
echo "### Repository Structure ###" >> "$OUTPUT_FILE"
tree . -I 'wandb|*__pycache__|media|*-requirements.txt|.venv' >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"

# Include README.md content
if [[ -f "README.md" ]]; then
echo "### README.md ###" >> "$OUTPUT_FILE"
cat README.md >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
fi

# Function to include content of a given file type, excluding .venv directory
include_files() {
local pattern="$1"
local header="$2"

find . -type f -name "$pattern" ! -path "./.venv/*" | while read -r file; do
echo "### $file ###" >> "$OUTPUT_FILE"
cat "$file" >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
done
}

# Include Python files, excluding those in .venv
include_files "*.py" "Python File"

# Include YAML files only from the 'test' folder
find ./test -type f -name "*.yaml" | while read -r yaml_file; do
echo "### $yaml_file ###" >> "$OUTPUT_FILE"
cat "$yaml_file" >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
done

echo "Documentation prompt has been generated in $OUTPUT_FILE"
Loading

0 comments on commit 7759adb

Please sign in to comment.