-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run finetuning.py error: TypeError: Invalid function argument. Expected parameter tensor
of type torch.Tensor, but got <class 'float'> instead.
#520
Comments
Hi! I can not to reproduce the error, on our H100 machine, I can run |
Thank you for your response, but the same error still occurs after running. |
Hi! I noticed that your model is also different, can you try with this commend |
Just the origial "meta-llama/Meta-Llama-3-8B-Instruct" or "meta-llama/Meta-Llama-3-8B" can not be used, with the message "Meta-Llama-3-8B does not appear to have a file named config.json" . python -m torch.utils.collect_env output: /ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime) Nvidia driver version: 545.23.08 CPU: Versions of relevant libraries:
|
I encountered the "8B does not appear to have a file named config.json" before and I think I solved it by a completely reinstall on the latest main: |
I tried it, but the same error occurred again at the beginning. [rank7]: Traceback (most recent call last): |
I noticed that the error is from evaluation function and I can reproduce your error now. The problem is the the eval for loop never got entered as len(eval_dataloader) = 0. Taking closer look, the length of dataset_val became 4 after ConcatDataset() which is too small for even one eval step on 8GPU. The temp solution is to change the eval set length to a bigger num like 1000 instead of 200 here. Remember to change both in line 30 and 32. I will talk to the team about how we can prevent this by adding some warning system. |
I got it, Thank you very much ! |
System Info
pip list |grep -i -E 'cuda|torch'
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
torch 2.3.0
GPU info: 8 x H100 SMX 80G
Information
🐛 Describe the bug
cmdline:
torchrun --nnodes 1 --nproc_per_node 8 --rdzv-id=111223 --rdzv-backend=c10d --rdzv-endpoint=10.0.1.3:12341 recipes/finetuning/finetuning.py --enable_fsdp --dataset alpaca_dataset --model_name Llama3/Meta-Llama-3-8B-Instruct-hg --use_peft --peft_method lora --output_dir PEFT_model
run it error.
Error logs
some error output:
[rank6]: Traceback (most recent call last):
[rank6]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in
[rank6]: fire.Fire(main)
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank6]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank6]: component, remaining_args = _CallAndUpdateTrace(
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank6]: component = fn(*varargs, **kwargs)
[rank6]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/finetuning.py", line 268, in main
[rank6]: results = train(
[rank6]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/utils/train_utils.py", line 224, in train
[rank6]: eval_ppl, eval_epoch_loss, temp_val_loss, temp_step_perplexity = evaluation(model, train_config, eval_dataloader, local_rank, tokenizer, wandb_run)
[rank6]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/utils/train_utils.py", line 372, in evaluation
[rank6]: dist.all_reduce(eval_loss, op=dist.ReduceOp.SUM)
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank6]: return func(*args, **kwargs)
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2195, in all_reduce
[rank6]: _check_single_tensor(tensor, "tensor")
[rank6]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 863, in _check_single_tensor
[rank6]: raise TypeError(
[rank6]: TypeError: Invalid function argument. Expected parameter
tensor
of type torch.Tensor[rank6]: but got <class 'float'> instead.
[rank1]: Traceback (most recent call last):
[rank1]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/recipes/finetuning/finetuning.py", line 8, in
[rank1]: fire.Fire(main)
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/finetuning.py", line 268, in main
[rank1]: results = train(
[rank1]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/utils/train_utils.py", line 224, in train
[rank1]: eval_ppl, eval_epoch_loss, temp_val_loss, temp_step_perplexity = evaluation(model, train_config, eval_dataloader, local_rank, tokenizer, wandb_run)
[rank1]: File "/ssd/llm_chinahpc/Llama3/llama-recipes/src/llama_recipes/utils/train_utils.py", line 372, in evaluation
[rank1]: dist.all_reduce(eval_loss, op=dist.ReduceOp.SUM)
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2195, in all_reduce
[rank1]: _check_single_tensor(tensor, "tensor")
[rank1]: File "/ssd/llm_chinahpc/software/anaconda3_2024.02/envs/llama3-recipes/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 863, in _check_single_tensor
[rank1]: raise TypeError(
[rank1]: TypeError: Invalid function argument. Expected parameter
tensor
of type torch.Tensor[rank1]: but got <class 'float'> instead.
Expected behavior
normally finished.
The text was updated successfully, but these errors were encountered: