Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong batch size after --resume on multiple GPUs #1936

Closed
AmirSa7 opened this issue Jan 14, 2021 · 3 comments · Fixed by #1942
Closed

wrong batch size after --resume on multiple GPUs #1936

AmirSa7 opened this issue Jan 14, 2021 · 3 comments · Fixed by #1942
Labels
bug Something isn't working

Comments

@AmirSa7
Copy link

AmirSa7 commented Jan 14, 2021

🐛 Bug

After running a training session on multiple GPUs, batch_size is read wrongly from opt.yaml causing an error.

To Reproduce (REQUIRED)

Run this line:

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 30 --data coco128.yaml --weights yolov5s.pt --device 1,2
  • Notice that we use batch-size=30 --> 15 for each GPU.

Press Ctrl+C to stop the session.
Now, try to resume:

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 30 --data coco128.yaml --weights yolov5s.pt --device 1,2 --resume

Output:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
github: Traceback (most recent call last):
  File "train.py", line 492, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/user/detection/tests/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 15 not multiple of GPU count 2
up to date with https://github.com/ultralytics/yolov5 ✅
Resuming training from ./runs/train/exp/weights/last.pt
Traceback (most recent call last):
  File "train.py", line 492, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/user/detection/tests/yolov5/utils/torch_utils.py", line 68, in select_device
    assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
AssertionError: batch-size 15 not multiple of GPU count 2
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/user/venv/yolov5/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/user/venv/yolov5/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/user/venv/yolov5/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '30', '--data', 'coco128.yaml', '--weights', 'yolov5s.pt', '--device', '1,2', '--resume']' returned non-zero exit status 1.

Expected behavior

The training should resume correctly, with the right batch size.

Environment

  • OS: [Ubuntu 16.04]
  • GPU [GTX 1080 Ti (x2)]

Additional context

--

@AmirSa7 AmirSa7 added the bug Something isn't working label Jan 14, 2021
@NanoCode012
Copy link
Contributor

NanoCode012 commented Jan 14, 2021

I'm thinking that we need to modify the below a bit .

yolov5/train.py

Line 480 in b75c432

opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori # reinstate

to

opt.cfg, opt.weights, opt.resume, opt.batch_size, opt.global_rank, opt.local_rank  = '', ckpt, True, opt.total_batch_size, *apriori

@AmirSa7 , could you try to modify it and tell me how it goes?

Edit: I tested it quickly and it works.

Commands:

python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

@AmirSa7
Copy link
Author

AmirSa7 commented Jan 14, 2021

I'm thinking that we need to modify the below a bit .

yolov5/train.py

Line 480 in b75c432

opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori # reinstate

to

opt.cfg, opt.weights, opt.resume, opt.batch_size, opt.global_rank, opt.local_rank  = '', ckpt, True, opt.total_batch_size, *apriori

@AmirSa7 , could you try to modify it and tell me how it goes?

Edit: I tested it quickly and it works.

Commands:

python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --epochs 3 --batch-size 64 --device 3,4
python -m torch.distributed.launch --master_port 9963 --nproc_per_node 2 train.py --resume

Yup, it works 👍 Thanks.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 14, 2021

@NanoCode012 thanks for the PR! Your fix has been merged now.

@AmirSa7 the fix proposed by @NanoCode012 has been merged into master now. Please git pull to receive this update and let us know if you encounter any other issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants