Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed alpaca dataset evalset length and make sure len(eval_loader)>0 #540

Merged
merged 1 commit into from
May 29, 2024

Conversation

wukaixingxp
Copy link
Contributor

@wukaixingxp wukaixingxp commented May 23, 2024

What does this PR do?

Fixed alpaca dataset evalset length by using 5% of the dataset as evalset so the len(eval_dataloader) >0, that means at least one batch can be loaded by dataloader. Also added a check to make sure len(eval_dataloader)>0 when run_validation=True, otherwise raise error and stop training.

This problem is raised by issue 520

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Evaluation step working in the alpaca finetuning
~/work/llama-recipes (fix/eval_dataloader_not_loaded)]$ torchrun --rdzv-endpoint=localhost:0 --rdzv-id=111223 --nnodes 1 --nproc_per_node 8 --rdzv-backend=c10d recipes/finetuning/finetuning.py --enable_fsdp --dataset alpaca_dataset --model_name meta-llama/Meta-Llama-3-8B --use_peft --peft_method lora --output_dir PEFT_model --max_train_step 2
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] 
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] *****************************************
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0523 10:30:35.612000 139639579272192 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.34s/it]
Loading checkpoint shards:  75%|██████████████████████████████████████████████████████████████████████████████████▌                           | 3/4 [00:12<00:04,  4.11s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model meta-llama/Meta-Llama-3-8B

--> meta-llama/Meta-Llama-3-8B has 8030.261248 Million params

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
bFloat16 enabled for mixed precision - using bfSixteen policy
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.27s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.25s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.34s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.35s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.36s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.36s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.40s/it]
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|▍                                                                                                         | 210/49402 [00:00<00:23, 2093.11it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   1%|█                                                                                                         | 481/49402 [00:00<00:19, 2453.73it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|                                                                                                                      | 0/49402 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   2%|█▌                                                                                                        | 747/49402 [00:00<00:19, 2547.50it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   0%|                                                                                                                      | 0/49402 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
--> Training Set Length = 49402
Preprocessing dataset:   3%|██▋                                                                                                      | 1281/49402 [00:00<00:18, 2625.27it/s]--> Validation Set Length = 2600
Preprocessing dataset:   1%|█                                                                                                         | 487/49402 [00:00<00:19, 2479.20it/s]--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2701.83it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2724.37it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2729.47it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2664.44it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2681.45it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2702.83it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2669.26it/s]
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2757.76it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2805.26it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 49402/49402 [00:18<00:00, 2646.96it/s]
Preprocessing dataset:  43%|█████████████████████████████████████████████▎                                                            | 1110/2600 [00:00<00:00, 2724.02it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2804.79it/s]
--> Num of Validation Set Batches loaded = 7
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  53%|████████████████████████████████████████████████████████▍                                                 | 1383/2600 [00:00<00:00, 2695.70it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  64%|███████████████████████████████████████████████████████████████████▍                                      | 1653/2600 [00:00<00:00, 2677.75it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset:  74%|██████████████████████████████████████████████████████████████████████████████▋                           | 1929/2600 [00:00<00:00, 2701.31it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2733.80it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2745.22it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2729.47it/s]
--> Num of Validation Set Batches loaded = 7
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2703.83it/s]
--> Num of Validation Set Batches loaded = 7
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Preprocessing dataset:  72%|████████████████████████████████████████████████████████████████████████████▍                             | 1876/2600 [00:00<00:00, 2641.07it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  83%|███████████████████████████████████████████████████████████████████████████████████████▌                  | 2147/2600 [00:00<00:00, 2660.76it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset:  93%|██████████████████████████████████████████████████████████████████████████████████████████████████▊       | 2423/2600 [00:00<00:00, 2689.39it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Preprocessing dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2600/2600 [00:00<00:00, 2672.48it/s]
--> Num of Validation Set Batches loaded = 7
NCCL version 2.20.5+cuda12.4
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                                             | 0/39 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Training Epoch: 1/3, step 1/39 completed (loss: 1.6738725900650024):   5%|███▍                                                               | 2/39 [00:16<04:29,  7.28s/it]max training steps reached, stopping training, total train steps finished:  2
Training Epoch: 1/3, step 1/39 completed (loss: 1.6738725900650024):   5%|███▍                                                               | 2/39 [00:16<05:08,  8.33s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5186713933944702):   5%|███▍                                                               | 2/39 [00:17<05:22,  8.72s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.543288230895996):   5%|███▍                                                                | 2/39 [00:16<05:09,  8.37s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.486218810081482):   5%|███▍                                                                | 2/39 [00:16<05:12,  8.43s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.473335862159729):   5%|███▍                                                                | 2/39 [00:16<05:10,  8.38s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.4909212589263916):   5%|███▍                                                               | 2/39 [00:17<05:20,  8.65s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5622209310531616):   5%|███▍                                                               | 2/39 [00:16<04:59,  8.09s/it]
Training Epoch: 1/3, step 1/39 completed (loss: 1.5295921564102173):   5%|███▍                                                               | 2/39 [00:17<05:25,  8.80s/it]
Max CUDA memory allocated was 34 GB
Max CUDA memory reserved was 41 GB
Peak active CUDA memory was 34 GB
CUDA Malloc retries : 0
CPU Total Peak Memory consumed during the train (max): 9 GB
evaluating Epoch:   0%|                                                                                                                               | 0/7 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch:   0%|                                                                                                                               | 0/7 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.16it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.16it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.18it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.15it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.22it/s]
evaluating Epoch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.14it/s]
 eval_ppl=tensor(4.4443, device='cuda:0') eval_epoch_loss=tensor(1.4916, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in PEFT_model directory
best eval loss on epoch 1 is 1.4916136264801025
Epoch 1: train_perplexity=1.0835, train_epoch_loss=0.0802, epoch time 17.234853900037706s
Key: avg_train_prep, Value: 1.0834850072860718
Key: avg_train_loss, Value: 0.08018267154693604
Key: avg_eval_prep, Value: 4.444261074066162
Key: avg_eval_loss, Value: 1.4916136264801025
Key: avg_epoch_time, Value: 17.234853900037706
Key: avg_checkpoint_time, Value: 0.6513841110281646
  • Manually cause len(eval_loader) == 0, error will be raised
~/work/llama-recipes (fix/eval_dataloader_not_loaded)]$ torchrun --rdzv-endpoint=localhost:0 --rdzv-id=111223 --nnodes 1 --nproc_per_node 8 --rdzv-backend=c10d recipes/finetuning/finetuning.py --enable_fsdp --dataset alpaca_dataset --model_name meta-llama/Meta-Llama-3-8B --use_peft --peft_method lora --output_dir PEFT_model --max_train_step 2
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] 
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] *****************************************
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0523 10:40:23.991000 139842235315200 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.41s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00,  3.20s/it]
Loading checkpoint shards:  75%|██████████████████████████████████████████████████████████████████████████████████▌                           | 3/4 [00:12<00:04,  4.31s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--> Model meta-llama/Meta-Llama-3-8B

--> meta-llama/Meta-Llama-3-8B has 8030.261248 Million params

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.46s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.50s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.48s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.55s/it]
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.56s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.04241987003816259
--> applying fsdp activation checkpointing...
Preprocessing dataset:   2%|██                                                                                                       | 1038/51802 [00:00<00:19, 2666.96it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   3%|██▋                                                                                                      | 1316/51802 [00:00<00:18, 2703.66it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   3%|███▏                                                                                                     | 1587/51802 [00:00<00:18, 2687.06it/s]--> Training Set Length = 51802
Preprocessing dataset:   4%|███▊                                                                                                     | 1863/51802 [00:00<00:18, 2710.21it/s]--> applying fsdp activation checkpointing...
--> Validation Set Length = 200
Preprocessing dataset:   0%|                                                                                                                      | 0/51802 [00:00<?, ?it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   4%|████▎                                                                                                    | 2152/51802 [00:00<00:17, 2765.31it/s]--> applying fsdp activation checkpointing...
Preprocessing dataset:   1%|▉                                                                                                         | 485/51802 [00:00<00:20, 2476.53it/s]--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2657.61it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2701.83it/s]
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank7]:     fire.Fire(main)
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank7]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank7]:     component, remaining_args = _CallAndUpdateTrace(
[rank7]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank7]:     component = fn(*varargs, **kwargs)
[rank7]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank7]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank7]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2690.56it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2684.25it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2745.95it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank0]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank0]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:18<00:00, 2732.65it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2696.89it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2704.91it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2714.17it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2705.17it/s]
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank5]:     fire.Fire(main)
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank5]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank5]:     component, remaining_args = _CallAndUpdateTrace(
[rank5]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank5]:     component = fn(*varargs, **kwargs)
[rank5]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank5]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank5]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2800.42it/s]
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank3]:     fire.Fire(main)
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank3]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank3]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2761.45it/s]
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank6]:     fire.Fire(main)
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank6]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank6]:     component, remaining_args = _CallAndUpdateTrace(
[rank6]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank6]:     component = fn(*varargs, **kwargs)
[rank6]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank6]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank6]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2770.14it/s]
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank4]:     fire.Fire(main)
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank4]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank4]:     component, remaining_args = _CallAndUpdateTrace(
[rank4]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank4]:     component = fn(*varargs, **kwargs)
[rank4]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank4]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank4]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2771.16it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank1]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank1]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 51802/51802 [00:19<00:00, 2712.80it/s]
Preprocessing dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 2761.09it/s]
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/kaiwu/work/llama-recipes/recipes/finetuning/finetuning.py", line 8, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:   File "/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:   File "/home/kaiwu/work/llama-recipes/src/llama_recipes/finetuning.py", line 254, in main
[rank2]:     raise ValueError("The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.")
[rank2]: ValueError: The eval set size is too small for dataloader to load even one batch. Please increase the size of eval set.
E0523 10:41:14.260000 139842235315200 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 280537) of binary: /home/kaiwu/miniconda3/envs/llama/bin/python

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Thanks for contributing 🎉!

@wukaixingxp wukaixingxp self-assigned this May 23, 2024
@wukaixingxp wukaixingxp changed the title fix edalpaca dataset evalset length and make sure len(eval_loader)>0 fixed alpaca dataset evalset length and make sure len(eval_loader)>0 May 23, 2024
Copy link
Contributor

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, could you address the comment regarding the collective's behavior?

src/llama_recipes/finetuning.py Show resolved Hide resolved
@wukaixingxp wukaixingxp requested a review from mreso May 23, 2024 18:46
Copy link
Contributor

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mreso mreso merged commit 41a46d8 into main May 29, 2024
3 checks passed
@wukaixingxp wukaixingxp deleted the fix/eval_dataloader_not_loaded branch July 25, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants