Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to reduce Training time for YoloX-S ,YoloX-M & YoloX-L #770

Closed
ajtvsv07 opened this issue Oct 5, 2021 · 9 comments
Closed

Suggestion to reduce Training time for YoloX-S ,YoloX-M & YoloX-L #770

ajtvsv07 opened this issue Oct 5, 2021 · 9 comments

Comments

@ajtvsv07
Copy link

ajtvsv07 commented Oct 5, 2021

Hi All,

I am trying to train YoloX-M model with COCO dataset it takes around 2 hours 30 mins for a single epoch with 6 V100 GPUs.
ETA is shown as approx 30 days.

command used to train model :
python tools/train.py -n yolox-m -d 6 -b 48 --fp16 -o

I tried increasing it data workers & batch size. Training time is not reducing drastically. I can reduce it to 2 hours 20mins
I also tried adding --cache to the train command
python tools/train.py -n yolox-m -d 6 -b 48 --fp16 -o --cache

I do not see a significant reduction in training time.

Any suggestion to reduce it to 10 - 15 mins / epoch.

Note :
yolov5 training time for COCO dataset is 10-15 mins / epoch with 6 V100 GPUs.

Regards,
Arunjeyan TVSV

@ajtvsv07
Copy link
Author

ajtvsv07 commented Oct 5, 2021

I updated data workers count + added cache to the training command.

yolox/exp/yolox_base.py
self.data_num_workers = 20 # 4

My training time has reduced to 1 hour 30 mins / epoch.

@jackhu-bme
Copy link

I've encountered same problems, too.I'm training custom dataset(5k medical images, much easier and smaller datset comparing to COCO) but my calculation power is much more limited.I wonder if I could increase learning rate and decrease training epochs to achieve an adequate performance.
Any other suggestions?

@ajtvsv07
Copy link
Author

@Joker316701882 can you suggest some technique to reduce our training time ?

@FateScript
Copy link
Member

What's your data time and train time if your don't change num_workers? @ajtvsv07

@ajtvsv07
Copy link
Author

ajtvsv07 commented Oct 13, 2021

I am trying to reproduce the results with COCO dataset.

Train time : 2 hours 30 mins / epoch
what do you mean by data time ? @FateScript

@FateScript
Copy link
Member

I am trying to reproduce the results with COCO dataset.

Train time : 2 hours 30 mins / epoch what do you mean by data time ? @FateScript

It's log info on your terminal, also some loss value is logged. Could you please check it?

@ajtvsv07
Copy link
Author

ajtvsv07 commented Oct 13, 2021

@FateScript found it.
data time : 0.063s
iter time : 4.817s

2021-10-04 17:56:09.762 | INFO | yolox.core.trainer:after_iter:238 - epoch: 1/300, iter: 10/2465, mem: 13175Mb, iter_time: 4.817s, data_time: 0.063s, total_loss: 15.4, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 8.5, cls_loss: 2.3, lr: 4.937e-09, size: 640, ETA: 41 days, 5:25:29

@FateScript
Copy link
Member

@FateScript found it. data time : 0.063s iter time : 4.817s

2021-10-04 17:56:09.762 | INFO | yolox.core.trainer:after_iter:238 - epoch: 1/300, iter: 10/2465, mem: 13175Mb, iter_time: 4.817s, data_time: 0.063s, total_loss: 15.4, iou_loss: 4.6, l1_loss: 0.0, conf_loss: 8.5, cls_loss: 2.3, lr: 4.937e-09, size: 640, ETA: 41 days, 5:25:29

Your iter time is too long, this might caused by limited computation power or wrong setting of environment.
normal train log could be found here. I doubt that your are using multi-gpu training and some devices might take more time.

@ajtvsv07
Copy link
Author

Thanks @FateScript @Joker316701882 for your logs.
It was limited CPU cores availability. Now the iter time has reduced to 1 sec & the training time has drastically reduced to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants