-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InfiniteDataLoader class #876
Conversation
Only initializes at first epoch. Saves time.
@NanoCode012 this is super interesting. The same idea actually crossed my mind yesterday, because when I watch the evolution VMs running, sometimes I run the nvidia-smi command, and occasionally a GPU is at 0% utilization. Evolution epochs do not run test.py after each epoch, nor do they save checkpoints, but there is still a period of several seconds of 0% utilization like you said, to reinitialize the dataloader. I think this might really help smaller datasets train faster. That's amazing you've already implemented it. Are you keeping an if statement at the end of each batch to check when an epoch would have elapsed and then running all the post-epoch functionality then? I'll check out the changes. |
Hm, I added results from COCO. I'm not sure if there was some kind of bottleneck when I was training the If it wasn't bottleneck, this could be a problem. I've looked at pytorchlightning and rightman's pytorch repo and they used very similar code to the current one though. |
I did a re-run, and there actually was some bottleneck. The new results are lower than their non-infinite counterpart. |
I tested out the Inifite dataloader on VOC training. The epoch transition did seem a few seconds faster to me, final time was roughly the same, about 5 hours for both. mAP seemed fine also with the infinite loader though, did not see any problems there. That's really strange that you saw some slower times with infinite. What do you the double slashes // mean? With regards to the COCO finetune results, this is uncharted territory for me. The finetuning hyps are evolved for 50 epochs of VOC, which I hope will be closer to the typical custom dataset than full COCO from scratch. That's really interesting that you achieved similar results for 20 and 100 finetuning epochs as from 300 from scratch. That's actually awesome news, because then perhaps I can run an evolution campaign for COCO finetuning using only 20 epochs results rather than having to evolve hyps on 300 epochs from scratch (which is impractical really for anyone without free cloud access, looking at you google brain team).
What do you mean by lower? Slower? |
Hm. I think this PR would be geared for those training on custom dataset with short epoch times but large number of epochs. I read somewhere that it can save 1hr of training for COCO 300 epochs on 8x V100. If we assume 10 seconds of time saved, that'll be 10*300 = 3000s/60 = 50 minutes per GPU. That would save some money.
The slashes separates my first run and second run with Infinite. The first run probably had some bottleneck with my
Sorry, by lower, I meant lower training time = faster training speed. I will add a graph to visualize later on. For example, COCO 2017
Oh yes! You are right. However, the fine tuned version were not able to reach the same height as ones training from scratch, so it should only be used as a guide. I'm thinking of setting one or two GPU to test this theory for a week. Could you give me the evolve commands and the starting hyp? (Should we use the InfiniteDataloader branch?) 20 epochs would take around 10*20=200 minutes = 3 hours per generation for |
@NanoCode012 ah ok, so infinite provides faster speed in all cases, and does not appear to harm mAP. It seems to be good news all around. Is this ready to merge then? |
Ready!
Would you like me to try to evolve |
Regarding the evolution, the results I'm seeing on VOC are exciting, but I'm not sure if they are repeatable on COCO. Here is a basic summary of what I've been doing:
# Hyperparameter evolution commands
while true; do
python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50 --evolve --bucket ult/voc --device $1
done
I'll raise a new issue with more details and complete code/commands for COCO finetuning evolution later today. |
@NanoCode012 ok, I'm thinking of how to run COCO finetuning evolution. This is going to be pretty slow unfortunately, but that's just how it is. I'm trying to figure out if I can get a 9-image mosaic update in before doing this. My tests don't seem to show a huge effect from switching from 4 to 9 image mosaic unfortunately. I should have more solid results in about a day. In any case, I've created a tentative yolov5:evolve docker image for evolving COCO https://hub.docker.com/r/ultralytics/yolov5/tags. The most efficient manner of evolving (i.e. FLOPS/day) is to evolve using separate containers, one per GPU. So for example for a 4-GPU machine (as shown in evolve.sh), this runs 4 containers, assigning a different GPU to each container, but with all containers pulling data from the local volume -v. # Start on 4-GPU machine
for i in 0 1 2 3; do
t=ultralytics/yolov5:evolve && sudo docker pull $t && sudo docker run -d --ipc=host --gpus all -v "$(pwd)"/VOC:/usr/src/VOC $t bash utils/evolve.sh $i
sleep 60 # avoid simultaneous evolve.txt read/write
done I usually assign a special GCP bucket to receive evolution results, and then all containers (no matter which machine or gpu they are on), thus read and write from that same common source. This works well most of the time, but occasionally I run into simultaneous read/write problems with gsutil, the GCP command line utility that handles all the traffic in and out of the bucket. I'm going to try to deploy a fix for this, and should have everything all set in the next couple days to begin COCO finetuning evolution. The command I had in mind is this (which would go in evolve.sh), and the hyp.finetuning.yaml file would be updated to the latest VOC results, which are below. On a T4, which I'm using, each epoch takes about 1 hour, so we'd get about 4 generations done every 5 days. If I can deploy 8 T4's, this would be about 45 generations per week. If you can pitch in GPU hours also (and of course anyone else reading this that would like to contribute), then we can both evolve to/from the same bucket, and maybe get this done faster! # Hyperparameter evolution commands
while true; do
# python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50 --evolve --bucket ult/voc --device $1
python train.py --batch 40 --weights yolov5m.pt --data coco.yaml --img 640 --epochs 30 --evolve --bucket ult/voc --device $1
done # Hyperparameters for VOC finetuning
# python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50
# See tutorials for hyperparameter evolution https://docs.ultralytics.com/yolov5
# Hyperparameter Evolution Results
# Generations: 249
# P R mAP.5 mAP.5:.95 box obj cls
# Metrics: 0.6 0.936 0.896 0.684 0.0115 0.00805 0.00146
lr0: 0.0032
lrf: 0.12
momentum: 0.843
weight_decay: 0.00036
giou: 0.0296
cls: 0.243
cls_pw: 0.631
obj: 0.301
obj_pw: 0.911
iou_t: 0.2
anchor_t: 2.91
anchors: 3.63
fl_gamma: 0.0
hsv_h: 0.0138
hsv_s: 0.664
hsv_v: 0.464
degrees: 0.373
translate: 0.245
scale: 0.898
shear: 0.602
perspective: 0.0
flipud: 0.00856
fliplr: 0.5
mixup: 0.243
|
Hi @glenn-jocher . I saw that you've done 250+ generations already. Cool! (I think we should create a new Issue for this, so there is more visibility and potentially more helpers.) I built a simple docker off your I checked that I can Fine-tuning cannot reach same map as a training from scratch. 5m finetuneCommand: python train.py --data coco.yaml --cfg yolov5m.yaml --weights yolov5m.pt --batch-size $bs --epoch $e Base mAP is From the above, it is safe to say that batchsize 64 with 40 epochs produces the "best" results. I'm not sure whether I should re-run it again to confirm this. A small test has been done for I will use one or two V100s for this. We can see how it turns out after a week or two.
Is this time for COCO or VOC? If this is COCO, this would be amazing because I could only do 11 generations a week with a single V100 whereas 4T4 could easily do 20? Your setup is really efficient!
I will wait for the fix then! Meanwhile, I will set a few runs for a better comparison for the Edit: I just saw that you use batch-size 40 for 5m. I didn't realize you changed it. Will set finetune test for this. |
@NanoCode012 agree, will open a new issue on this soon. --batch 40 is max possible on single 15GB T4 YOLOv5m GCP training with docker image, or --batch 48 with single 16GB V100. I guess you must have used multi-GPU to reach 64. It's possible your finetuning mAPs are higher than you think: test.py runs a slightly more comprehensive (but slightly slower) mAP solution when called directly i.e. By the way, I just tested a custom dataset and was surprised to see that it finetunes much better with hyp.scratch.yaml than hyp.finetune.yaml. I'm pretty confused by this result. It's possible evolution results on one dataset may not correlate well to a second dataset unfortunately. I'll have to think about it. |
@NanoCode012 ah sorry, to answer your other question, the times are for COCO. COCO trains about 10X slower than VOC. VOC can train much faster because it has less images (16k vs 118k), which are natively smaller and which I train smaller (512 vs 640), and can --cache for faster dataloading due to smaller size. VOC 512 epoch takes 5 min on T4 or 2 min on V100, vs 60 min or 20 min for COCO 640. |
Hi @glenn-jocher ,
I tried to do the below but got the reverse. # From ReadMe.md
python test.py --data coco.yaml --img 640 --conf 0.001 --weights ... As an example of my bs64 of 40 epochs coco2017,
On coco128 google colab,
Hmm, I was actually thinking if there could be one hyp file for each dataset/goal ( The hard part would be usability (need to explain to users.. tutorials) and maintenance. It's hard for a one-fit-all solution.
Okay! |
Oh, that's really strange. I've not seen that before. I was just running through the VOC results. I checked the difference here between using the final last.pt mAP from training and running test.py afterwards using last.pt. Most improved, with the greatest improvement in mAP@0.5:0.95.
|
Best VOC mAP is 92.2! |
!python test.py --data voc.yaml --weights '../drive/My Drive/cloud/runs/voc/exp3_yolov5x/weights/last.pt' --img 640 --iou 0.50 --augment
Namespace(augment=True, batch_size=32, conf_thres=0.001, data='./data/voc.yaml', device='', img_size=640, iou_thres=0.5, merge=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['../drive/My Drive/cloud/runs/voc/exp3_yolov5x/weights/last.pt'])
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)
Fusing layers...
Model Summary: 284 layers, 8.85745e+07 parameters, 8.45317e+07 gradients
Scanning labels ../VOC/labels/val.cache (4952 found, 0 missing, 0 empty, 0 duplicate, for 4952 images): 4952it [00:00, 18511.94it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 155/155 [04:38<00:00, 1.80s/it]
all 4.95e+03 1.2e+04 0.587 0.963 0.922 0.743
Speed: 53.0/1.3/54.2 ms inference/NMS/total per 640x640 image at batch-size 32 |
Hi @glenn-jocher Congrats! How much did it go up by changing only hyp? Do the hyps affect the models differently since you used the 5m model to train?
Adding this changed quite a lot! From earlier comment, bs64 e40.
Here's an interesting effect of the finetune vs scratch on the 5m (still training) Since mosaic9 did not work out, are you planning to add some more changes (like with the new file |
@NanoCode012 haha, yes that's what I'm worried about. I thought I was evolving some finetuning hyps that would be good for the whole world, but now I'm thinking maybe they're just mainly good for VOC. The final hyps are drastically different than from scratch hyps I started from. lr0 for example drops from 0.01 to 0.003 and momentum dropped from 0.937 to 0.843. This produced about +3 mAP increase in VOC. The good news is that all models s/m/l/x appeared to benefit equally from evolving on YOLOv5m, so that's great news on it's own. That means that 5m results at least can be counted on to correlate well with the entire range. I'm going to finetune the 4 from the initial hyps (just once for 50 epochs) also to do a better before and after comparison, because right now I'm just relying on my memory. I was looking at sotabench yesterday and decided to try it out, as their existing results seem quite slow to me. It's possible we exceed all of the existing models greatly in terms of mAP at a given speed level. But I found a few bugs in their examples, submitted a PR to their repo, and alltogethor found support very limited there (forum has 10 posts over 2 years), which is unfortunate because the idea seems great. Mosaic9 didn't fail, it just didn't provide markedly different results than mosaic4 in the VOC tests I ran. I think a key factor in the mosaic is cropping at the image edges, but this is for further exploration. So I suppose that yes, I just need to fix the gsutil overwrite bug and then we can start finetuning COCO. I see your plot there, that's super important, as I'm not sure which hyps to start evolving from. Looks like blue is going to win, but let's wait. |
Hi @glenn-jocher , contrary to our expectations, scratch won! Finetune is from gen 306 of VOC.
This really made me think whether all my past results could be improved if I had used the scratch hyp or that the current hyp are overfitting to VOC. I see that you've made a bug fix for gs util. I'm thinking of the possibility that two reads and upload and same time, cancelling the other. I was thinking of using a Mutex lock styled approach.
This is to prevent writing at same time. Theoretically, there would not be a high chance that they will read at the same time between the both of us, but if others were to join in, the chances would be increased.. This would be a sure-fire way of blocking albeit expensive, and I don't think this is the norm with blocking in python.. Edit: Added table. |
@NanoCode012 wow! Yup, well that's interesting. It's likely the hyperparameter space has many local minima, and hyp.scratch.yaml clearly appears to be a better local minima for COCO than hyp.finetune.yaml, so we should start the hyp evolution there. That's unfortunate that the VOC finetuning results do not correlate well with COCO. Yes, I made a fix! When evolving to a local evolve.txt it is almost impossible to read/write at the same time, as the speeds are near instantaneous (and the file is small), so there are no issues evolving locally, but when evolving from different VMs/nodes to a cloud evolve.txt, gsutil can take several seconds to make the connection and read/write, which sometimes causes a corrupt file if another VM is doing the same at the same time, which causes gsutil to delete the file when it detects corruption, losing all results (!). The new fix should avoid this by ignoring corrupted/empty files, so only a single generation from a single node would be lost rather than the entire file. Ok, so know to start from hyp.scratch.yaml, we know to use YOLOv5m, now all that's left is to decide a number of epochs and start. I see you used 40 there with good results. |
@NanoCode012 I just finished a set of YOLOv5m 20 epochs finetuning for each hyp file. I get the same results, scratch is better. We can essentially start evolution now, but another idea came to me. I'm thinking the dip in results on epoch 1 may be due to the warmup settings. The warmup slowly ramps up lr and momentum during the first 3 epochs, it is mainly intended for training from scratch to help stability. The intial values for lr and momentum are 0.0 and 0.9 generally, but there is a param group 2 that starts with a very aggressive lr of 0.10, and actually ramps this down to lr0. When training from scratch this helps adjust output biases especially, and works well because bias params are few in number and are not naturally unstable the way weights can be. I'm thinking this might be related to the initial drop on finetuning. The effect on final mAP is likely limited, but I'm exploring whether to adjust the warmup when EDIT: Another option is to make these warmup settings evolve-able, but we already have 23 hyps so I'd rather not grow the group if I don't have to. |
@NanoCode012 ok, I am testing a handful of warmup strategies for finetuning with hyp.scratch.yaml now (including no warmup). I should have results by tomorrow. |
Thanks for the explanation.
Yep. Batchsize test from
Time is only for reference. There can be bottlenecks as I train multiple at a time. Would using multiple containers be faster than a single container? I've only done multiple training on a single container at a time, so I can keep track of which commit version I'm on and consolidate the results on Tensorboard. Yesterday, I've also set tests for different epochs besides 40 such as 20,30,50,60,70,100. The last three are at around epoch 55, so we will see the results by tomorrow as well. This should give us a good grasp on which to choose. We should balance number of epochs with the accuracy. I suspect the longer the epoch, the accuracy becomes marginally better.
Looking forward to these results! |
Hi @glenn-jocher , my tests on different epochs are now done. Overview: Closer look near peak: An extra 10 epochs is around 3-4 hours. An epoch is around 19-20 mins. Results at highest:
We should safely be able to use epoch 40 at highest batch-size (64 for me) unless your warmup results proves otherwise. My next concern is how far we are going to tune the hyps (to not overfit) and whether these will have direct correlation to training from scratch as we do not have any conclusive evidence it would happen, only that it reaches near the same value. For ex, pycocotest for finetune_100 reaches 63.32 .
Do you also see |
@NanoCode012 wow, that's really good work! I really need to switch to tensorboard for my cloud training, I'm still stuck plotting results.txt files. I tested 12 different warmup techniques (YOLOv5m --epochs 10), and was surprised to see minimal impact overall. One interesting takeaway is results320, no warmup, shows by far the best GIoU loss (need to rename this to box loss, CIoU is used now), which seems to show that box loss always benefits from more training to a greater degree than the other two. If I zoom in on the final 5 epochs I see that 322 was the best, which starts from a very low initial momentum of 0.5 for all param groups. But 324 (pink, initial momentum = 0.8, initial bias lr0 = 0.01) showed the best trend in the final epochs. So I think I'll create a docker image with a mix of the best two results above (initial momentum 0.6, initial bias lr 0.05), and also make them evolveable. From your results we see diminishing returns from increased epochs, with 20 to 30 showing the largest improvement vs added time. So it looks like 30 may be a good sweet spot. |
@NanoCode012 BTW, about your other question, how finetune hyps will relate to scratch hyps, I really don't know. Evolving finetuning hyps on COCO is probably our most feasible task. We could evolve from-scratch hyps similarly, perhaps for more epochs, i.e. 50 or 100, but these would take much longer to test, since we'd want to apply them to a full 300 epochs, and its possible that scratch hyps evolved for even 100 epochs would overfit far too soon when trained to 300 epochs, so we'd be back to playing the game of preventing early overfitting on our results. This is basically what I did with YOLOv3 last year, it's a very difficult path for those with limited resources. If we were Google we could use a few 8X V100 machines to evolve scratch hyps to 300 full epochs, problem solved, but that door isn't open to us unfortunately, so evolving finetuning for 30 seems like our best bet. To be clear though I don't know how well it will work, it's a big experiment. In ML often you just have to do it and find out. |
@NanoCode012 hyp evolution is all set now! See #918 |
* Add InfiniteDataLoader Only initializes at first epoch. Saves time. * Moved class to a better location
* Add InfiniteDataLoader Only initializes at first epoch. Saves time. * Moved class to a better location
* Add InfiniteDataLoader Only initializes at first epoch. Saves time. * Moved class to a better location
Feature
Dataloader takes a chunk of time at the start of every epoch to start worker processes. We only need to initialize it once at epoch 1 through this
InfiniteDataLoader
class which subclasses the DataLoader class.Other repo implements this as well. It was where I got the idea from.
Resources:
Results
Theoretically, this would improve reduce training time on small datasets more than large ones as their epoch times are shorter. The training time gap should also increase as we increase the number of epochs.
I would like to test this fully before this PR is ready, especially in DDP mode as the resources said conflicting things on the DDP sampler.
If you have some certain dataset you would like me to test speed on, please feel free to add comment.
master: 08e97a2
Env: conda py37
Model: 5s
Batch-size / Total Batch-size: 64
GPU: V100
Epochs: 10
Note:
coco128
: Maybe due to finetuning hyp onlr
, the mAPs stay around 68 even after 10 epochs which was a bit unusual.coco2017
: mAP seems consistent(See comment below for graphs)
Commands to test
Prospectives:
multiprocessing.spawn
causes dataloaders to initalize workers at the start of every epoch(or batch) in a slow way and it slows down speed significantly. That is why we are still usinglaunch
. Maybe this PR could open the way to usemp.spawn
. This way, we can eliminate DP and just use--device
to specify single or DDP.🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Introduction of an Infinite DataLoader for improved data loading in the YOLOv5 training pipeline.
📊 Key Changes
torch.utils.data.DataLoader
with a new customInfiniteDataLoader
.InfiniteDataLoader
and_RepeatSampler
.InfiniteDataLoader
reuses workers to constantly feed data._RepeatSampler
is designed to repeat the sampling process indefinitely.🎯 Purpose & Impact