Ensure you have the following installed:
- Python 3.x: Make sure you have a recent version of Python installed.
- pip: Python package installer.
- PyTorch: Recommended installation with support for Metal Performance Shaders for Apple Silicon.
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
- Create a virtual environment and install dependencies:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
- For M1, install PyTorch with MPS:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
- Place your dataset (e.g.,
data.txt
) indata/
. - Create a dataset directory:
mkdir data/my_dataset
- Use a preprocessing script to convert the dataset into a format suitable for training:
python data/shakespeare_char/prepare.py --input_file=data/data.txt --output_dir=data/my_dataset
- Update
prepare.py
for specific tokenization needs.
- Edit configurations in
config/
or create a custom configuration file. - Set hyperparameters based on your dataset size and available memory (e.g., batch size, learning rate).
- Start training with the MPS device for M1 GPUs:
python train.py --data_dir=data/my_dataset --device=mps
- Training can be adjusted by modifying batch size or model parameters if out-of-memory errors occur.
- Install TensorBoard:
pip install tensorboard tensorboard --logdir=out
- Access TensorBoard via
http://localhost:6006/
to visualize training progress.
- Generate text with the trained model:
python sample.py --out_dir=out --start="Your initial text here"
- Tune parameters like temperature and max length for varied results.
- Start with a small dataset for initial runs.
- Experiment with different hyperparameters.
- Pre-train on a larger corpus if the dataset is small to improve performance.
Deactivate the virtual environment when you are finished:
deactivate
Start by installing the dependencies. Open a terminal and run:
pip install torch numpy transformers datasets tiktoken wandb tqdm
These libraries are needed for training, tokenization, dataset handling, logging, and progress tracking.
-
Clone the NanoGPT Repository:
git clone https://github.com/karpathy/nanoGPT.git cd nanoGPT
-
Prepare Your Dataset:(in our case
"ufo"
)- Organize your dataset in a text format. For example, if you're training on a custom text file, save it as
data/your_dataset/your_text.txt
ordata/ufo/UFOs_coord.csv in our case
for example https://www.kaggle.com/datasets/utkarshx27/ufo-sights-2016-us-and-canada.
- Organize your dataset in a text format. For example, if you're training on a custom text file, save it as
-
Preprocess the Data:
-
Create a script similar to
data/shakespeare_char/prepare.py
, which tokenizes and converts your text into binary files for training and validation (train.bin
andval.bin
). -
Example preprocessing script (
data/your_dataset/prepare.py
):import os from tiktoken import Encoding # Load your dataset csv or txt #in our case "data/ufo/UFOs_coord.csv" df = pd.read_csv('data/ufo/UFOs_coord.csv',encoding='ISO-8859-1') data= df['Summary'].str.cat(sep='\n') #or something like below two line with open('data/your_dataset/your_text.txt', 'r') as f: data = f.read() # Tokenize the text encoder = Encoding.get_encoder('gpt2') tokens = encoder.encode(data) # Split tokens into training and validation split_idx = int(len(tokens) * 0.9) train_tokens = tokens[:split_idx] val_tokens = tokens[split_idx:] # Save binary files os.makedirs('data/your_dataset', exist_ok=True) train_tokens.tofile('data/your_dataset/train.bin') val_tokens.tofile('data/your_dataset/val.bin')
-
-
Run the Preprocessing Script:
python data/your_dataset/prepare.py
-
Create a Training Configuration File:
-
Start with
config/train_shakespeare_char.py
as a template. -
Adjust parameters based on your dataset and computational resources:
# config/train_your_dataset.py #in our case "out-ufo" you can find our case in config/train_on_ufo.py out_dir = 'out-your-dataset' eval_interval = 500 eval_iters = 200 log_interval = 100 # Training settings batch_size = 64 block_size = 256 # Context size n_layer = 6 n_head = 6 n_embd = 384 dropout = 0.1 learning_rate = 3e-4 max_iters = 10000 lr_decay_iters = 10000 min_lr = 6e-5 beta2 = 0.99 warmup_iters = 100 # Model loading/saving init_from = 'scratch' # Initialize a new model
-
-
Run the Training Script: (
"At least 500 iteration"
)- If using an Apple Silicon Mac, use the MPS backend for better performance:
python train.py config/train_your_dataset.py --device=mps
- If using a standard Mac with an Intel CPU:
python train.py config/train_your_dataset.py --device=cpu --compile=False
- If using an Apple Silicon Mac, use the MPS backend for better performance:
-
Monitor Training:
- Optionally, you can use Weights and Biases (
wandb
) for logging and monitoring:wandb login
- Then set
wandb_project = 'your_project_name'
in your config file.
- Optionally, you can use Weights and Biases (
If you want to finetune a pretrained model, update your configuration to initialize from a GPT-2 checkpoint (e.g., init_from='gpt2'
), and adjust the learning rate to a smaller value (e.g., 1e-5
).
-
Run the Sampling Script:
python sample.py --out_dir=out-your-dataset --start="Your prompt here" --device=mps
-
Adjust Sampling Parameters:
- You can control the number of generated tokens, sampling temperature, etc., in the script.
-
Disable PyTorch 2.0 Compile Mode: If you encounter issues with PyTorch's compile mode, run training with
--compile=False
. -
Tune Hyperparameters:
- Adjust the number of layers, heads, and embedding dimensions.
- Experiment with different learning rates and dropout values.
Evaluate your model's performance on tasks like text generation, and fine-tune it based on results. Adjust hyperparameters as needed to improve quality.
This guide should help you train your own NanoGPT model on a Mac with ease!