Strange "splitted" training #13070

unrue · 2024-06-07T07:54:45Z

unrue
Jun 7, 2024

I'm running Yolov5 on 16 Gpus, over HPC machine having 4 GPus per node:

srun python train.py --epochs 400 --data /home/my_dataset/data/bc.yaml --weights "" --cfg models/hub/yolov5x6.yaml --hyp runs/evolve/exp17/hyp_evolve.yaml --cache --device 0,1,2,3 2>&1 | tee out_training

The training is running on 4 nodes, but it seems splitted in two parts:

Logging results to runs/train/exp147 Starting training for 400 epochs... Image sizes 640 train, 640 val Using 16 dataloader workers Logging results to runs/train/exp146

I see two training folder exp146, exp147 where the training is storing results. Why two folders? Which one I have to consider? This is the first time I have such behaviour. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange "splitted" training #13070

{{title}}

Replies: 0 comments

Select a reply

Strange "splitted" training #13070

unrue Jun 7, 2024

Replies: 0 comments

unrue
Jun 7, 2024