ResNet strikes back: And what about fine tuning? #901
-
Congrats to you and colleagues on putting this paper out @rwightman ! Just finished reading it through. Lots of goodies in there and good point of reference for knowledge gaps I might want to fill re SOTA techniques. My take on the main point of the paper goes something like: In order to get anywhere near to some "true" performance benchmark for any given architecture, we should also optimize it jointly with the training procedure, and we show that. Oh and btw, we found some killer recipes while we were at it. I (like many other DL practitioners) pretty much exclusively fine tune pretrained models though. So as an open question, how transferable are the lessons learned in this study to fine-tuning?
Table 5 shows that after fine-tuning, the gap is closed between PyTorch ResNet and the A-series recipes. The only one that's a clear win is iNaturalist. And I don't know how much training recipe optimization PyTorch did but I'm assuming far less. Does this detract from the impact of the main point of the paper? Or does it imply that the main point of the paper needs to be reapplied in the fine-tuning step in order to get the full benefits (or that we should find out if that even works as well as we'd like it to)? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
@alexander-soare a lot of the ideas have some amount of transfer, but one usually (not always) scales back the degree of augmentation, regularization, learning rate, epochs for transfer learning. I feel fine-tuning is usually easier than training from scratch but you can certainly see big differences across hparam choices. The transfer settings here were similar to the A3 from-scratch settings, LR lower but not drastically so. The transfer runs were done late in the process when Hugo and Herve had some free cycles on their training infra. It was an important sanity check to do but not a focus here so extensive search wasn't done and no effort to target different transfer for different source weights (I believe that could be worthwhile if one had the time/resources). With transfer I often find myself with 'quick' settings where you dial the augmentation, regularization way back and overfit quickly, or 'long' where you try to stretch it out by keeping augreg high and fine-tune for more epochs. What approach works best isn't always clear, similarity of the two datasets or the size of the transfer set are factors there. Also keep in mind that augmentation is highly dataset + task specific. What worked well in ImageNet classification won't necessarily be at all appropriate for other datasets and tasks. I don't have a clear mapping of train -> fine-tune. There are some fuzzy rules I'd follow or starting points I'd used based on experience gained from similar situations, but hparam search is the most reliable way to find optimal settings. One can cut down the search space based on past experience and 'gut feel'. For the optimizer, the best from scratch optimizer isn't necessarily the best fine-tune opt. For the paper we discussed doing more transfer experiments but decided to keep it minimal due to constraints. I'm interested in exploring that more.. I've had thoughts of doing a more in depth transfer evaluation framework for timm models, but it's quite a bit of work to setup across multiple datasets, models, and weights trained with different hparams and hparam search for the transfer settings. Even just finding worthwhile datasets that are closer to what practitioners would see in the field are hard. I don't see much value in transferring to CIFAR for instance. INaturalist is good, but it is fairly large so could also train from scratch reasonably. We need some more 'small but real' datasets. |
Beta Was this translation helpful? Give feedback.
@alexander-soare a lot of the ideas have some amount of transfer, but one usually (not always) scales back the degree of augmentation, regularization, learning rate, epochs for transfer learning. I feel fine-tuning is usually easier than training from scratch but you can certainly see big differences across hparam choices. The transfer settings here were similar to the A3 from-scratch settings, LR lower but not drastically so.
The transfer runs were done late in the process when Hugo and Herve had some free cycles on their training infra. It was an important sanity check to do but not a focus here so extensive search wasn't done and no effort to target different transfer for different sou…