-
-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong batch size after --resume on multiple GPUs #1936
Comments
I'm thinking that we need to modify the below a bit . Line 480 in b75c432
to opt.cfg, opt.weights, opt.resume, opt.batch_size, opt.global_rank, opt.local_rank = '', ckpt, True, opt.total_batch_size, *apriori @AmirSa7 , could you try to modify it and tell me how it goes? Edit: I tested it quickly and it works. Commands: python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume |
Yup, it works 👍 Thanks. |
@NanoCode012 thanks for the PR! Your fix has been merged now. @AmirSa7 the fix proposed by @NanoCode012 has been merged into master now. Please |
🐛 Bug
After running a training session on multiple GPUs, batch_size is read wrongly from opt.yaml causing an error.
To Reproduce (REQUIRED)
Run this line:
Press Ctrl+C to stop the session.
Now, try to resume:
Output:
Expected behavior
The training should resume correctly, with the right batch size.
Environment
Additional context
--
The text was updated successfully, but these errors were encountered: