Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoregressive LM and v0.1 release #1

Merged
merged 54 commits into from
Nov 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b0ea14d
scaffolding, next implement GPT and make package more mature
maxrousseau Aug 16, 2024
e6c924c
WIP GPT pythia implementation, early stages
maxrousseau Aug 21, 2024
5df6f5a
iter
maxrousseau Aug 21, 2024
6037a05
iter
maxrousseau Aug 21, 2024
2cc26c9
iter
maxrousseau Aug 21, 2024
339df42
iter
maxrousseau Aug 21, 2024
a2fbeec
cleanup project a little
maxrousseau Aug 23, 2024
407006b
forward pass ok, next safetensor and inference eval
maxrousseau Sep 2, 2024
f70457b
pythia weight transfer OK, next TEST inference
maxrousseau Sep 2, 2024
c67eee4
pythia weight transfer OK, next TEST inference
maxrousseau Sep 2, 2024
91649df
rope debugging WIP
maxrousseau Sep 6, 2024
5beff99
fixed RoPE, now pythia tested and working, WIP cleanup
maxrousseau Sep 7, 2024
d0bd922
cleaned up for now, next is tinystories datapipe, training, then KV-c…
maxrousseau Sep 8, 2024
6fa46d6
iter, WIP tinystories datapipe
maxrousseau Sep 12, 2024
5d086c0
test tinystories datapipe
maxrousseau Sep 13, 2024
0f9f9c7
datapipe OK for tinystories, next step: train pythia on TS
maxrousseau Sep 13, 2024
046a680
training seems to be working, next step full tinystories finetune on GPU
maxrousseau Sep 14, 2024
57eb263
WIP kvcache
maxrousseau Sep 18, 2024
065c6d1
WIP kvcache
maxrousseau Sep 18, 2024
f0b9d8e
forgot to add the tests
maxrousseau Sep 26, 2024
2089ec0
moved media
maxrousseau Sep 26, 2024
3ac4b00
moved media
maxrousseau Sep 26, 2024
1dd7ae9
iter
maxrousseau Sep 26, 2024
f937038
WIP kv_cache
maxrousseau Sep 29, 2024
5ec5c08
WIP kv_cache again
maxrousseau Sep 29, 2024
ee0e0c8
kv cache working, next-step is to write test against HF implementation
maxrousseau Sep 30, 2024
8e2559c
found the bug I think, need to implement the UPPER RIGHT causal mask …
maxrousseau Oct 1, 2024
6026370
kvcache implementation is working! -fixed the causal mask
maxrousseau Oct 1, 2024
70a7531
cleanup up the test function a bit
maxrousseau Oct 2, 2024
7f5e190
scuffed out greedy decoding
maxrousseau Oct 5, 2024
e4003cb
added stop when repeat_ngram
maxrousseau Oct 5, 2024
58baf7f
iter, training runs work on test, next implement logging and evaluation
maxrousseau Oct 20, 2024
163b457
iter
maxrousseau Oct 23, 2024
87ddff3
getting closer to v0.1, logging and wandb is fine next DEBUG=1 and GP…
maxrousseau Oct 24, 2024
7c39029
readme deets
maxrousseau Oct 24, 2024
7cbb627
readme deets
maxrousseau Oct 24, 2024
ffeb09d
iter
maxrousseau Oct 29, 2024
3fd69da
some fuckery with uv and torch
maxrousseau Oct 29, 2024
388e877
iter
maxrousseau Oct 31, 2024
fa99cfe
GPU needs to be debugged, some tensors are allocated on CPU!
maxrousseau Oct 31, 2024
dd899be
fix neoxrope devices
maxrousseau Nov 3, 2024
2de297a
training on GPU is working
maxrousseau Nov 4, 2024
1624cd7
DEBUG=1 single batch implemented
maxrousseau Nov 6, 2024
6b796ca
iter
maxrousseau Nov 7, 2024
70cf05e
iter
maxrousseau Nov 7, 2024
8a481ed
tinystories full run
maxrousseau Nov 8, 2024
719bccd
iter readme
maxrousseau Nov 9, 2024
4553a41
removed data config, integrated into single config
maxrousseau Nov 9, 2024
81ba73b
readme improvements
maxrousseau Nov 9, 2024
988660d
readme improvements
maxrousseau Nov 9, 2024
91f9c27
readme improvements
maxrousseau Nov 9, 2024
3b37742
readme improvements
maxrousseau Nov 9, 2024
7552e5e
checkpointing OK, docprompt OK, ready for alpha
maxrousseau Nov 9, 2024
fb1d58b
readme iter
maxrousseau Nov 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
__pycache__/
*.py[cod]
*$py.class
wandb/

*.DS_Store

Expand Down Expand Up @@ -160,3 +161,6 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# uv
uv.lock
193 changes: 151 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,170 @@
<div class="header" align="center">

# rafale

Rafale is (for now) a simple and opinionated transformer encoder training CLI.
<div class="logo">
<p align="center">
<img src="./media/rafale-logo.png" alt="rafale-logo" width="200px" />
<br>
Rafale is a simple and opinionated transformer training CLI.
</p>
</div>

## Dependencies
</div>

Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.
## 💡Purpose

```
torch, lightning-fabric (or) accelerate, datasets, rich (eyecandy) ~~tokenizers will be removed~~
```
Rafale provides an opinionated scaffolding for training transformers. It is solely built to be an efficient
learning/research tool. It is **not** a fully fledged library for large scale training.

@TODO :: (check out this stream on HF accelerate)[https://www.youtube.com/watch?v=X-Jx5-YskKY]
It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. The best way to
use rafale is to simply fork it and build on top of it for your specific purposes.

### Core dependencies

## Purpose
Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.

This package is solely built to be an efficient research tool. It will not support data preprocessing/handling
pipelines. It should be thought of as a starting point for research projects to bootstrap experiments on small LMs.
```
torch, composer, datasets, tokenizers
```

Should be used pip installable via git and setup to be easily hackable to build on top of it.
## 🚀 Installation & Usage

Datasets should be preshuffled and pretokenized, only load it from disk and feed it to the dataloader with the collator
function.
Setup with ```uv``` ([install uv](https://github.com/astral-sh/uv)).
```sh
$ git clone <repo url>
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt (or cpu-requirements.txt)
$ uv pip install -e .
```

Launch a run with a configuration.

## Usage
```sh
$ python rafale/main -c test/pythia_tinystories.yaml
```

Mostly optimized for SLURM clusters.
What if I just want to prepare my dataset? ```DATA=1``` will run the data preparation and caching pipeline without
launching the training run.

```sh
$ DATA=1 python rafale/main -c test/pythia_tinystories.yaml
```

rafale run -c config.yaml # set DEBUG=1 for a sanity check
What if I want to test my model to make sure that its learning? ```DEBUG=1``` will run 10 epochs on a single training
batch (same for train/eval), the model should fit quickly if there are no bugs in the implementation.

```sh
$ DEBUG=1 python rafale/main -c test/pythia_tinystories.yaml
```

## Roadmap

v0.1
- [ ] Local model weight loading
- [ ] load weights from safetensors and include it in the config (BERT and GPT2)
- [ ] integration with lighteval (?)
- [ ] Logging/Progress/Debug outputs with Rich library
- ~~RoBERTa BPE tokenizer with TikToken (compare w/ HF), on the fly tokenization to be handled by dataloader's
collator (for MLM)~~
- ~~model will be tied to the tokenizer, so dataloader will be defined after the model and use it's tokenizer~~
- We don't want anything to do with preprocessing, all data should be split/augmented/shuffled/tokenized/etc. All we
do with this tool is load it from disk, turn it to a tensor and send it to the model
- [ ] Local dataloader
- [ ] ```debug``` single batch debug
- [ ] ```main.py``` handles both training and evaluation (together or separately)
- [-] BERT/RoBERTa support (MLM objective)
+ [ ] move the testing in the notebook to a debug file in the modeling folder
+ **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
+ optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
+ RMSNorm
- [ ] simple trainer (see lightning-fabric simple trainer example and start from there)
+ bf16/fp16, gradient clipping, and gradient accumulation

v0.2
- DeepSpeed ZeRO
- Streaming dataloader

### 🔧 Under the hood

The goal of rafale is to provide a single entry point for data preparation and training. You configure the model and
dataset. Then call the training job.

When calling a run, first we run the datapipepline. If the dataset has already been processed (tokenized, padded,
chunked, etc.), it will be loaded from the cache (default location is ```~/.rafale_cache```.

> [!NOTE]
> #### Adding a new model
> To add a new model, you need write a new configuration to ```rafale/models/configurations.py```, and add it's key to
> ```model_config_dict``` in ```rafale/main.py```.
>
> Look at the ```ComposerLM``` wrapper class in ```rafale/models/decoder.py``` to check if all your building blocks are
> there. Otherwise you may need to modify/write a new wrapper.
>
> #### Adding a new datapipeline
>
> If the dataset is hosted on huggingface, simply use git lfs to clone the repo locally or use the repo name as the
> dataset path. Same goes for tokenizers since we use their tokenizer implementation.
>
> You will need to add a new datapipeline class in ```rafale/datapipes.py``` where the ```_prepare``` method all data
> preprocessing (tokenization, chunking, truncation, etc.) **EXCEPT** padding. Padding will be performed by the datacollator.

### 📕 Docs

Append this file ```llm-docprompt.txt``` to your favorite LLM and ask away.

### 🦾 Supported models


| Name | Implemented | Inference test | Training test |
|:------------|:------------|:---------------|:--------------|
| BERT | ✅ | | |
| RoBERTa | ✅ | | |
| Pythia | ✅ | ✅ | ✅ |
| CLIP/SigLIP | ⏳ | | |


## 🔮 Roadmap

<details>
<summary>v0.1</summary>


### v0.1 - initial release
- [x] single entrypoint CLI
- [ ] simple deploy/build
- [x] CPU macos build - Ok, uv run works with this
- [x] local linux machine - for now uv for venv + requirements.txt
- [ ] SLURM compute-canada - TBD
- NOTE: because uv still does not fully play well with pytorch recommend semi-manual setup*
- [ ] load weights from safetensors and include it in the config (BERT/RoBERTa and Pythia)
- [x] pythia
- [ ] BERT/RoBERTa (need to move from HF to safetensors)
- [ ] MLM
- [ ] Classification
- [x] Pythia KV-cache implementation
- [x] greedy generation
- [ ] datapipes for CLM and MLM
- local dataloader for now
- [x] CLM tinystories
- [ ] MLM tinystories
- [ ] Imdb classification
- [x] ```main.py``` handles both training and evaluation (together or separately)
- [x] Mosaic Composer/Trainer
+ [x] fp16
+ [x] gradient clipping
+ [x] gradient accumulation (automatically handled by composer)
+ [x] building blocks are nn.Modules, specific models are ComposerModel classes with methods to load safetensor weights
automatically (keep in a single separate file for each model)
+ [x] set DEBUG=1 for 1 batch sanity check before launching a run

Datapipelines
1. [x] tokenize
2. [x] concat and split w/ block size (pad w/ collator)
3. [x] save to disk {source}_{tokname}_bs{int}_len{int}
4. [x] data_collator: *next* pad (if desired), label shift right and return torch tensor # HF: does this in the model...
5. [x] test with model training
6. [ ] tiny stories but for MLM also
</details>

<details>
<summary>v1.0</summary>

### path to v1.0
cleanup and additional features
- [ ] clean up ```tests``` for pythia and bert models on tinystories
- [ ] move the testing in the notebook to a debug file in the modeling folder
- [ ] optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
- [ ] try out schedulefree, SOAP, and other optimizers
- [ ] **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
- [ ] multimodality CLIP
- [ ] integration with lm-eval-harness (guide)[https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage]

</details>

## I am GPU-rich what do I use?

For large scale experiments other frameworks/libraries exist:
- lingua (Facebookresearch)
- torchtitan (Pytorch)
- torchtune (Pytorch)
- litGPT (LightningAI)
- GPT-NeoX (EleutherAI)
- nanotron (Huggingface)
- llm-foundry (MosaicML)
Empty file added cpu-requirements.txt
Empty file.
98 changes: 98 additions & 0 deletions cuda-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
anyio==4.6.2.post1
argcomplete==3.5.1
arrow==1.3.0
attrs==24.2.0
backoff==2.2.1
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
composer==0.26.0
coolname==2.2.0
datasets==3.0.2
dill==0.3.8
docker-pycreds==0.4.0
filelock==3.16.1
frozenlist==1.5.0
fsspec==2024.9.0
gitdb==4.0.11
gitpython==3.1.43
gql==3.5.0
graphql-core==3.2.5
huggingface-hub==0.26.2
idna==3.10
importlib-metadata==8.5.0
jinja2==3.1.4
lightning-utilities==0.11.8
markdown-it-py==3.0.0
markupsafe==3.0.2
mdurl==0.1.2
mosaicml-cli==0.6.42
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.2
numpy==2.1.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
packaging==24.1
pandas==2.2.3
pillow==10.4.0
platformdirs==4.3.6
prompt-toolkit==3.0.48
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
py-cpuinfo==9.0.0
pyarrow==18.0.0
pygments==2.18.0
python-dateutil==2.9.0.post0
pytorch-ranger==0.1.1
pytz==2024.2
pyyaml==6.0.2
questionary==1.10.0
-e file:///home/max/code/rafale
requests==2.32.3
rich==13.9.3
ruamel-yaml==0.18.6
ruamel-yaml-clib==0.2.12
safetensors==0.4.5
sentry-sdk==2.17.0
setproctitle==1.3.3
setuptools==75.3.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sympy==1.13.3
tabulate==0.9.0
termcolor==2.5.0
tokenizers==0.20.1
torch==2.4.0
torch-optimizer==0.3.0
torchmetrics==1.4.0.post0
torchvision==0.19.0
tqdm==4.66.6
triton==3.0.0
types-python-dateutil==2.9.0.20241003
typing-extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
validators==0.34.0
wandb==0.18.5
wcwidth==0.2.13
websockets==11.0.3
xxhash==3.5.0
yarl==1.17.1
zipp==3.20.2
43 changes: 43 additions & 0 deletions generate_llm_docprompt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

# Output file
OUTPUT_FILE="llm-docprompt.txt"

# Clear the output file if it exists
> "$OUTPUT_FILE"

# Add repo structure, excluding unwanted directories like .venv
echo "### Repository Structure ###" >> "$OUTPUT_FILE"
tree . -I 'wandb|*__pycache__|media|*-requirements.txt|.venv' >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"

# Include README.md content
if [[ -f "README.md" ]]; then
echo "### README.md ###" >> "$OUTPUT_FILE"
cat README.md >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
fi

# Function to include content of a given file type, excluding .venv directory
include_files() {
local pattern="$1"
local header="$2"

find . -type f -name "$pattern" ! -path "./.venv/*" | while read -r file; do
echo "### $file ###" >> "$OUTPUT_FILE"
cat "$file" >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
done
}

# Include Python files, excluding those in .venv
include_files "*.py" "Python File"

# Include YAML files only from the 'test' folder
find ./test -type f -name "*.yaml" | while read -r yaml_file; do
echo "### $yaml_file ###" >> "$OUTPUT_FILE"
cat "$yaml_file" >> "$OUTPUT_FILE"
echo -e "\n\n" >> "$OUTPUT_FILE"
done

echo "Documentation prompt has been generated in $OUTPUT_FILE"
Loading