Merge pull request #1 from maxrousseau/gpt_simple

autoregressive LM and v0.1 release
maxrousseau · Nov 9, 2024 · 7759adb · 7759adb
2 parents 46eafac + fb1d58b
commit 7759adb
Show file tree

Hide file tree

Showing 25 changed files with 5,188 additions and 381 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@
 __pycache__/
 *.py[cod]
 *$py.class
+wandb/
 
 *.DS_Store
 
@@ -160,3 +161,6 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+# uv
+uv.lock
diff --git a/README.md b/README.md
@@ -1,61 +1,170 @@
+<div class="header" align="center">
+
 # rafale
 
-Rafale is (for now) a simple and opinionated transformer encoder training CLI.
+<div class="logo">
+<p align="center">
+<img src="./media/rafale-logo.png" alt="rafale-logo" width="200px" />
+<br>
+Rafale is a simple and opinionated transformer training CLI.
+</p>
+</div>
 
-## Dependencies
+</div>
 
-Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.
+## 💡Purpose
 
-```
-torch, lightning-fabric (or) accelerate, datasets, rich (eyecandy) ~~tokenizers will be removed~~
-```
+Rafale provides an opinionated scaffolding for training transformers. It is solely built to be an efficient
+learning/research tool. It is **not** a fully fledged library for large scale training.
 
-@TODO :: (check out this stream on HF accelerate)[https://www.youtube.com/watch?v=X-Jx5-YskKY]
+It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. The best way to
+use rafale is to simply fork it and build on top of it for your specific purposes.
 
+### Core dependencies
 
-## Purpose
+Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.
 
-This package is solely built to be an efficient research tool. It will not support data preprocessing/handling
-pipelines. It should be thought of as a starting point for research projects to bootstrap experiments on small LMs.
+```
+torch, composer, datasets, tokenizers
+```
 
-Should be used pip installable via git and setup to be easily hackable to build on top of it.
+## 🚀 Installation & Usage
 
-Datasets should be preshuffled and pretokenized, only load it from disk and feed it to the dataloader with the collator
-function.
+Setup with ```uv``` ([install uv](https://github.com/astral-sh/uv)).
+```sh
+$ git clone <repo url>
+$ cd rafale
+$ uv venv
+$ . .venv/bin/activate
+$ uv pip install -r cuda-requirements.txt (or cpu-requirements.txt)
+$ uv pip install -e .
+```
+
+Launch a run with a configuration.
 
-## Usage
+```sh
+$ python rafale/main -c test/pythia_tinystories.yaml
+```
 
-Mostly optimized for SLURM clusters.
+What if I just want to prepare my dataset? ```DATA=1``` will run the data preparation and caching pipeline without
+launching the training run.
 
 ```sh
+$ DATA=1 python rafale/main -c test/pythia_tinystories.yaml
+```
 
-rafale run -c config.yaml # set DEBUG=1 for a sanity check
+What if I want to test my model to make sure that its learning? ```DEBUG=1``` will run 10 epochs on a single training
+batch (same for train/eval), the model should fit quickly if there are no bugs in the implementation.
 
+```sh
+$ DEBUG=1 python rafale/main -c test/pythia_tinystories.yaml
 ```
 
-## Roadmap
-
-v0.1
-- [ ] Local model weight loading
-- [ ] load weights from safetensors and include it in the config (BERT and GPT2)
-- [ ] integration with lighteval (?)
-- [ ] Logging/Progress/Debug outputs with Rich library
-- ~~RoBERTa BPE tokenizer with TikToken (compare w/ HF), on the fly tokenization to be handled by dataloader's
-      collator (for MLM)~~
-    - ~~model will be tied to the tokenizer, so dataloader will be defined after the model and use it's tokenizer~~
-    - We don't want anything to do with preprocessing, all data should be split/augmented/shuffled/tokenized/etc. All we
-      do with this tool is load it from disk, turn it to a tensor and send it to the model
-- [ ] Local dataloader
-- [ ] ```debug``` single batch debug
-- [ ] ```main.py``` handles both training and evaluation (together or separately)
-- [-] BERT/RoBERTa support (MLM objective)
-  + [ ] move the testing in the notebook to a debug file in the modeling folder
-  + **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
-  + optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
-  + RMSNorm
-- [ ] simple trainer (see lightning-fabric simple trainer example and start from there)
-  + bf16/fp16, gradient clipping, and gradient accumulation
-
-v0.2
-- DeepSpeed ZeRO
-  - Streaming dataloader
+
+### 🔧 Under the hood
+
+The goal of rafale is to provide a single entry point for data preparation and training. You configure the model and
+dataset. Then call the training job.
+
+When calling a run, first we run the datapipepline. If the dataset has already been processed (tokenized, padded,
+chunked, etc.), it will be loaded from the cache (default location is ```~/.rafale_cache```.
+
+> [!NOTE]
+> #### Adding a new model
+> To add a new model, you need write a new configuration to ```rafale/models/configurations.py```, and add it's key to
+> ```model_config_dict``` in ```rafale/main.py```.
+>
+> Look at the ```ComposerLM``` wrapper class in ```rafale/models/decoder.py``` to check if all your building blocks are
+> there. Otherwise you may need to modify/write a new wrapper.
+>
+> #### Adding a new datapipeline
+>
+> If the dataset is hosted on huggingface, simply use git lfs to clone the repo locally or use the repo name as the
+> dataset path. Same goes for tokenizers since we use their tokenizer implementation.
+>
+> You will need to add a new datapipeline class in ```rafale/datapipes.py``` where the ```_prepare``` method all data
+> preprocessing (tokenization, chunking, truncation, etc.) **EXCEPT** padding. Padding will be performed by the datacollator.
+
+### 📕 Docs
+
+Append this file ```llm-docprompt.txt``` to your favorite LLM and ask away.
+
+### 🦾 Supported models
+
+
+| Name        | Implemented | Inference test | Training test |
+|:------------|:------------|:---------------|:--------------|
+| BERT        | ✅          |                |               |
+| RoBERTa     | ✅          |                |               |
+| Pythia      | ✅          | ✅             | ✅           |
+| CLIP/SigLIP | ⏳          |                |               |
+
+
+## 🔮 Roadmap
+
+<details>
+  <summary>v0.1</summary>
+
+
+### v0.1 - initial release
+- [x] single entrypoint CLI
+- [ ] simple deploy/build
+  - [x] CPU macos build - Ok, uv run works with this
+  - [x] local linux machine - for now uv for venv + requirements.txt
+  - [ ] SLURM compute-canada - TBD
+    - NOTE: because uv still does not fully play well with pytorch recommend semi-manual setup*
+- [ ] load weights from safetensors and include it in the config (BERT/RoBERTa and Pythia)
+  - [x] pythia
+  - [ ] BERT/RoBERTa (need to move from HF to safetensors)
+    - [ ] MLM
+    - [ ] Classification
+- [x] Pythia KV-cache implementation
+- [x] greedy generation
+- [ ] datapipes for CLM and MLM
+  - local dataloader for now
+  - [x] CLM tinystories
+  - [ ] MLM tinystories
+  - [ ] Imdb classification
+- [x] ```main.py``` handles both training and evaluation (together or separately)
+- [x] Mosaic Composer/Trainer
+  + [x] fp16
+  + [x] gradient clipping
+  + [x] gradient accumulation (automatically handled by composer)
+  + [x] building blocks are nn.Modules, specific models are ComposerModel classes with methods to load safetensor weights
+    automatically (keep in a single separate file for each model)
+  + [x] set DEBUG=1 for 1 batch sanity check before launching a run
+
+Datapipelines
+1. [x] tokenize
+2. [x] concat and split w/ block size (pad w/ collator)
+3. [x] save to disk {source}_{tokname}_bs{int}_len{int}
+4. [x] data_collator: *next* pad (if desired), label shift right and return torch tensor # HF: does this in the model...
+5. [x] test with model training
+6. [ ] tiny stories but for MLM also
+</details>
+
+<details>
+  <summary>v1.0</summary>
+
+### path to v1.0
+cleanup and additional features
+- [ ] clean up ```tests``` for pythia and bert models on tinystories
+- [ ] move the testing in the notebook to a debug file in the modeling folder
+- [ ] optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
+- [ ] try out schedulefree, SOAP, and other optimizers
+- [ ] **layerwise decay** for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
+- [ ] multimodality CLIP
+- [ ] integration with lm-eval-harness (guide)[https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage]
+
+</details>
+
+## I am GPU-rich what do I use?
+
+For large scale experiments other frameworks/libraries exist:
+- lingua (Facebookresearch)
+- torchtitan (Pytorch)
+- torchtune (Pytorch)
+- litGPT (LightningAI)
+- GPT-NeoX (EleutherAI)
+- nanotron (Huggingface)
+- llm-foundry (MosaicML)
diff --git a/cpu-requirements.txt b/cpu-requirements.txt
diff --git a/cuda-requirements.txt b/cuda-requirements.txt
@@ -0,0 +1,98 @@
+aiohappyeyeballs==2.4.3
+aiohttp==3.10.10
+aiosignal==1.3.1
+anyio==4.6.2.post1
+argcomplete==3.5.1
+arrow==1.3.0
+attrs==24.2.0
+backoff==2.2.1
+certifi==2024.8.30
+charset-normalizer==3.4.0
+click==8.1.7
+composer==0.26.0
+coolname==2.2.0
+datasets==3.0.2
+dill==0.3.8
+docker-pycreds==0.4.0
+filelock==3.16.1
+frozenlist==1.5.0
+fsspec==2024.9.0
+gitdb==4.0.11
+gitpython==3.1.43
+gql==3.5.0
+graphql-core==3.2.5
+huggingface-hub==0.26.2
+idna==3.10
+importlib-metadata==8.5.0
+jinja2==3.1.4
+lightning-utilities==0.11.8
+markdown-it-py==3.0.0
+markupsafe==3.0.2
+mdurl==0.1.2
+mosaicml-cli==0.6.42
+mpmath==1.3.0
+multidict==6.1.0
+multiprocess==0.70.16
+networkx==3.4.2
+numpy==2.1.2
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==9.1.0.70
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-nccl-cu12==2.20.5
+nvidia-nvjitlink-cu12==12.6.77
+nvidia-nvtx-cu12==12.1.105
+packaging==24.1
+pandas==2.2.3
+pillow==10.4.0
+platformdirs==4.3.6
+prompt-toolkit==3.0.48
+propcache==0.2.0
+protobuf==5.28.3
+psutil==6.1.0
+py-cpuinfo==9.0.0
+pyarrow==18.0.0
+pygments==2.18.0
+python-dateutil==2.9.0.post0
+pytorch-ranger==0.1.1
+pytz==2024.2
+pyyaml==6.0.2
+questionary==1.10.0
+-e file:///home/max/code/rafale
+requests==2.32.3
+rich==13.9.3
+ruamel-yaml==0.18.6
+ruamel-yaml-clib==0.2.12
+safetensors==0.4.5
+sentry-sdk==2.17.0
+setproctitle==1.3.3
+setuptools==75.3.0
+six==1.16.0
+smmap==5.0.1
+sniffio==1.3.1
+sympy==1.13.3
+tabulate==0.9.0
+termcolor==2.5.0
+tokenizers==0.20.1
+torch==2.4.0
+torch-optimizer==0.3.0
+torchmetrics==1.4.0.post0
+torchvision==0.19.0
+tqdm==4.66.6
+triton==3.0.0
+types-python-dateutil==2.9.0.20241003
+typing-extensions==4.12.2
+tzdata==2024.2
+urllib3==2.2.3
+validators==0.34.0
+wandb==0.18.5
+wcwidth==0.2.13
+websockets==11.0.3
+xxhash==3.5.0
+yarl==1.17.1
+zipp==3.20.2
diff --git a/generate_llm_docprompt.sh b/generate_llm_docprompt.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+
+# Output file
+OUTPUT_FILE="llm-docprompt.txt"
+
+# Clear the output file if it exists
+> "$OUTPUT_FILE"
+
+# Add repo structure, excluding unwanted directories like .venv
+echo "### Repository Structure ###" >> "$OUTPUT_FILE"
+tree . -I 'wandb|*__pycache__|media|*-requirements.txt|.venv' >> "$OUTPUT_FILE"
+echo -e "\n\n" >> "$OUTPUT_FILE"
+
+# Include README.md content
+if [[ -f "README.md" ]]; then
+    echo "### README.md ###" >> "$OUTPUT_FILE"
+    cat README.md >> "$OUTPUT_FILE"
+    echo -e "\n\n" >> "$OUTPUT_FILE"
+fi
+
+# Function to include content of a given file type, excluding .venv directory
+include_files() {
+    local pattern="$1"
+    local header="$2"
+
+    find . -type f -name "$pattern" ! -path "./.venv/*" | while read -r file; do
+        echo "### $file ###" >> "$OUTPUT_FILE"
+        cat "$file" >> "$OUTPUT_FILE"
+        echo -e "\n\n" >> "$OUTPUT_FILE"
+    done
+}
+
+# Include Python files, excluding those in .venv
+include_files "*.py" "Python File"
+
+# Include YAML files only from the 'test' folder
+find ./test -type f -name "*.yaml" | while read -r yaml_file; do
+    echo "### $yaml_file ###" >> "$OUTPUT_FILE"
+    cat "$yaml_file" >> "$OUTPUT_FILE"
+    echo -e "\n\n" >> "$OUTPUT_FILE"
+done
+
+echo "Documentation prompt has been generated in $OUTPUT_FILE"