fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

wukaixingxp · 2024-05-23T17:48:59Z

What does this PR do?

Fixed alpaca dataset evalset length by using 5% of the dataset as evalset so the len(eval_dataloader) >0, that means at least one batch can be loaded by dataloader. Also added a check to make sure len(eval_dataloader)>0 when run_validation=True, otherwise raise error and stop training.

This problem is raised by issue 520

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Evaluation step working in the alpaca finetuning

~/work/llama-recipes (fix/eval_dataloader_not_loaded)]$ torchrun --rdzv-endpoint=localhost:0 --rdzv-id=111223 --nnodes 1 --nproc_per_node 8 --rdzv-backend=c10d recipes/finetuning/finetuning.py --enable_fsdp --dataset alpaca_dataset --model_name meta-llama/Meta-Llama-3-8B --use_peft --peft_method lora --output_dir PEFT_model --max_train_step 2
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] 
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] *****************************************
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.34s/it]
Loading checkpoint shards:  75%|██████████████████████████████████████████████████████████████████████████████████▌                           | 3/4 [00:12<00:04,  4.11s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model meta-llama/Meta-Llama-3-8B

--> meta-llama/Meta-Llama-3-8B has 8030.261248 Million params

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
bFloat16 enabled for mixed precision - using bfSixteen policy
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.27s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.25s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.34s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.35s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.36s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.36s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.40s/it]
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|▍                                                                                                         | 210/49402 [00:00<00:23, 2093.11it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   1%|█                                                                                                         | 481/49402 [00:00<00:19, 2453.73it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|                                                                                                                      | 0/49402 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   2%|█▌                                                                                                        | 747/49402 [00:00<00:19, 2547.50it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|                                                                                                                      | 0/49402 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
--> Training Set Length = 49402
Preprocessing dataset:   3%|██▋                                                                                                      | 1281/49402 [00:00<00:18, 2625.27it/s]--> Validation Set Length = 2600
Preprocessing dataset:   1%|█                                                                                                         | 487/49402 [00:00<00:19, 2479.20it/s]--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2701.83it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2724.37it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2729.47it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2664.44it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2681.45it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2702.83it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2669.26it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2757.76it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2805.26it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2646.96it/s]
Preprocessing dataset:  43%|█████████████████████████████████████████████▎                                                            | 1110/2600 [00:00<00:00, 2724.02it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2804.79it/s]
--> Num of Validation Set Batches loaded = 7
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  53%|████████████████████████████████████████████████████████▍                                                 | 1383/2600 [00:00<00:00, 2695.70it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  64%|███████████████████████████████████████████████████████████████████▍                                      | 1653/2600 [00:00<00:00, 2677.75it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset:  74%|██████████████████████████████████████████████████████████████████████████████▋                           | 1929/2600 [00:00<00:00, 2701.31it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2733.80it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2745.22it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2729.47it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2703.83it/s]
--> Num of Validation Set Batches loaded = 7
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset:  72%|████████████████████████████████████████████████████████████████████████████▍                             | 1876/2600 [00:00<00:00, 2641.07it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  83%|███████████████████████████████████████████████████████████████████████████████████████▌                  | 2147/2600 [00:00<00:00, 2660.76it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  93%|██████████████████████████████████████████████████████████████████████████████████████████████████▊       | 2423/2600 [00:00<00:00, 2689.39it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2672.48it/s]
--> Num of Validation Set Batches loaded = 7
NCCL version 2.20.5+cuda12.4
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Training Epoch: 1/3, step 1/39 completed (loss: 1.6738725900650024):   5%|███▍                                                               | 2/39 [00:16<04:29,  7.28s/it]max training steps reached, stopping training, total train steps finished:  2
Training Epoch: 1/3, step 1/39 completed (loss: 1.6738725900650024):   5%|███▍                                                               | 2/39 [00:16<05:08,  8.33s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5186713933944702):   5%|███▍                                                               | 2/39 [00:17<05:22,  8.72s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.543288230895996):   5%|███▍                                                                | 2/39 [00:16<05:09,  8.37s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.486218810081482):   5%|███▍                                                                | 2/39 [00:16<05:12,  8.43s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.473335862159729):   5%|███▍                                                                | 2/39 [00:16<05:10,  8.38s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.4909212589263916):   5%|███▍                                                               | 2/39 [00:17<05:20,  8.65s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5622209310531616):   5%|███▍                                                               | 2/39 [00:16<04:59,  8.09s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5295921564102173):   5%|███▍                                                               | 2/39 [00:17<05:25,  8.80s/it]
Max CUDA memory allocated was 34 GB
Max CUDA memory reserved was 41 GB
Peak active CUDA memory was 34 GB
CUDA Malloc retries : 0
CPU Total Peak Memory consumed during the train (max): 9 GB
evaluating Epoch:   0%|                                                                                                                               | 0/7 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch:   0%|                                                                                                                               | 0/7 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.16it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.16it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.18it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.15it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.14it/s]
 eval_ppl=tensor(4.4443, device='cuda:0') eval_epoch_loss=tensor(1.4916, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in PEFT_model directory
best eval loss on epoch 1 is 1.4916136264801025
Epoch 1: train_perplexity=1.0835, train_epoch_loss=0.0802, epoch time 17.234853900037706s
Key: avg_train_prep, Value: 1.0834850072860718
Key: avg_train_loss, Value: 0.08018267154693604
Key: avg_eval_prep, Value: 4.444261074066162
Key: avg_eval_loss, Value: 1.4916136264801025
Key: avg_epoch_time, Value: 17.234853900037706
Key: avg_checkpoint_time, Value: 0.6513841110281646

Manually cause len(eval_loader) == 0, error will be raised

~/work/llama-recipes (fix/eval_dataloader_not_loaded)]$ torchrun --rdzv-endpoint=localhost:0 --rdzv-id=111223 --nnodes 1 --nproc_per_node 8 --rdzv-backend=c10d recipes/finetuning/finetuning.py --enable_fsdp --dataset alpaca_dataset --model_name meta-llama/Meta-Llama-3-8B --use_peft --peft_method lora --output_dir PEFT_model --max_train_step 2
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] 
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] *****************************************
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.41s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00,  3.20s/it]
Loading checkpoint shards:  75%|██████████████████████████████████████████████████████████████████████████████████▌                           | 3/4 [00:12<00:04,  4.31s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model meta-llama/Meta-Llama-3-8B

--> meta-llama/Meta-Llama-3-8B has 8030.261248 Million params

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.46s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.50s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.48s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.55s/it]
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.56s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
--> applying fsdp activation checkpointing...
Preprocessing dataset:   2%|██                                                                                                       | 1038/51802 [00:00<00:19, 2666.96it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   3%|██▋                                                                                                      | 1316/51802 [00:00<00:18, 2703.66it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   3%|███▏                                                                                                     | 1587/51802 [00:00<00:18, 2687.06it/s]--> Training Set Length = 51802
Preprocessing dataset:   4%|███▊                                                                                                     | 1863/51802 [00:00<00:18, 2710.21it/s]--> applying fsdp activation checkpointing...
--> Validation Set Length = 200
Preprocessing dataset:   0%|                                                                                                                      | 0/51802 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   4%|████▎                                                                                                    | 2152/51802 [00:00<00:17, 2765.31it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   1%|▉                                                                                                         | 485/51802 [00:00<00:20, 2476.53it/s]--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2657.61it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2701.83it/s]
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank7]:     fire.Fire(main)
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank7]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank7]:     component, remaining_args = _CallAndUpdateTrace(
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank7]:     component = fn(*varargs, **kwargs)
[rank7]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank7]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank7]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2690.56it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2684.25it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2745.95it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank0]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank0]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:18<00:00, 2732.65it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2696.89it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2704.91it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2714.17it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2705.17it/s]
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank5]:     fire.Fire(main)
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank5]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank5]:     component, remaining_args = _CallAndUpdateTrace(
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank5]:     component = fn(*varargs, **kwargs)
[rank5]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank5]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank5]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2800.42it/s]
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank3]:     fire.Fire(main)
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank3]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank3]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2761.45it/s]
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank6]:     fire.Fire(main)
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank6]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank6]:     component, remaining_args = _CallAndUpdateTrace(
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank6]:     component = fn(*varargs, **kwargs)
[rank6]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank6]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank6]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2770.14it/s]
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank4]:     fire.Fire(main)
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank4]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank4]:     component, remaining_args = _CallAndUpdateTrace(
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank4]:     component = fn(*varargs, **kwargs)
[rank4]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank4]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank4]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2771.16it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank1]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank1]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2712.80it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2761.09it/s]
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank2]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank2]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
E0523 10:41:14.260000 139842235315200 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 280537) of binary: /home/kaiwu/miniconda3/envs/llama/bin/python

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

…(eval_loader)>0

mreso

Looks good, could you address the comment regarding the collective's behavior?

src/llama_recipes/finetuning.py

mreso

LGTM

fix alpaca dataset by using 5% of the data as eval and make sure len(…

b2e5699

…(eval_loader)>0

wukaixingxp requested review from HamidShojanazeri and mreso May 23, 2024 17:48

wukaixingxp self-assigned this May 23, 2024

facebook-github-bot added the cla signed label May 23, 2024

wukaixingxp changed the title ~~fix edalpaca dataset evalset length and make sure len(eval_loader)>0~~ fixed alpaca dataset evalset length and make sure len(eval_loader)>0 May 23, 2024

mreso requested changes May 23, 2024

View reviewed changes

src/llama_recipes/finetuning.py Show resolved Hide resolved

wukaixingxp requested a review from mreso May 23, 2024 18:46

mreso approved these changes May 29, 2024

View reviewed changes

mreso merged commit 41a46d8 into main May 29, 2024
3 checks passed

wukaixingxp mentioned this pull request Jun 3, 2024

Reasoning behind Alapca's default split #364

Closed

wukaixingxp deleted the fix/eval_dataloader_not_loaded branch July 25, 2024 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

wukaixingxp commented May 23, 2024 •

edited

Loading

mreso left a comment

mreso left a comment

fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

Conversation

wukaixingxp commented May 23, 2024 • edited Loading

What does this PR do?

Feature/Issue validation/testing

Before submitting

mreso left a comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

wukaixingxp commented May 23, 2024 •

edited

Loading