ResNet strikes back: And what about fine tuning? #901

alexander-soare · 2021-10-04T17:34:22Z

alexander-soare
Oct 4, 2021

Congrats to you and colleagues on putting this paper out @rwightman ! Just finished reading it through. Lots of goodies in there and good point of reference for knowledge gaps I might want to fill re SOTA techniques.

My take on the main point of the paper goes something like: In order to get anywhere near to some "true" performance benchmark for any given architecture, we should also optimize it jointly with the training procedure, and we show that. Oh and btw, we found some killer recipes while we were at it.

I (like many other DL practitioners) pretty much exclusively fine tune pretrained models though. So as an open question, how transferable are the lessons learned in this study to fine-tuning?

Would performance be as sensitive to the types of searches you et al. did? ie should one go to that much effort to optimize the training recipe for fine tuning?
Is there a clear way to map pretraining recipe to fine-tuning recipe? Eg "if LAMB was the winner for pretraining, also use it for fine-tuning", "double the weight decay for fine tuning". "divide the starting lr by 5". Or would it just be a whole different ball game?

Table 5 shows that after fine-tuning, the gap is closed between PyTorch ResNet and the A-series recipes. The only one that's a clear win is iNaturalist. And I don't know how much training recipe optimization PyTorch did but I'm assuming far less. Does this detract from the impact of the main point of the paper? Or does it imply that the main point of the paper needs to be reapplied in the fine-tuning step in order to get the full benefits (or that we should find out if that even works as well as we'd like it to)?

Answered by rwightman

Oct 5, 2021

@alexander-soare a lot of the ideas have some amount of transfer, but one usually (not always) scales back the degree of augmentation, regularization, learning rate, epochs for transfer learning. I feel fine-tuning is usually easier than training from scratch but you can certainly see big differences across hparam choices. The transfer settings here were similar to the A3 from-scratch settings, LR lower but not drastically so.

The transfer runs were done late in the process when Hugo and Herve had some free cycles on their training infra. It was an important sanity check to do but not a focus here so extensive search wasn't done and no effort to target different transfer for different sou…

View full answer

rwightman · 2021-10-05T23:14:50Z

rwightman
Oct 5, 2021
Maintainer

@alexander-soare a lot of the ideas have some amount of transfer, but one usually (not always) scales back the degree of augmentation, regularization, learning rate, epochs for transfer learning. I feel fine-tuning is usually easier than training from scratch but you can certainly see big differences across hparam choices. The transfer settings here were similar to the A3 from-scratch settings, LR lower but not drastically so.

The transfer runs were done late in the process when Hugo and Herve had some free cycles on their training infra. It was an important sanity check to do but not a focus here so extensive search wasn't done and no effort to target different transfer for different source weights (I believe that could be worthwhile if one had the time/resources).

With transfer I often find myself with 'quick' settings where you dial the augmentation, regularization way back and overfit quickly, or 'long' where you try to stretch it out by keeping augreg high and fine-tune for more epochs. What approach works best isn't always clear, similarity of the two datasets or the size of the transfer set are factors there. Also keep in mind that augmentation is highly dataset + task specific. What worked well in ImageNet classification won't necessarily be at all appropriate for other datasets and tasks.

I don't have a clear mapping of train -> fine-tune. There are some fuzzy rules I'd follow or starting points I'd used based on experience gained from similar situations, but hparam search is the most reliable way to find optimal settings. One can cut down the search space based on past experience and 'gut feel'. For the optimizer, the best from scratch optimizer isn't necessarily the best fine-tune opt.

For the paper we discussed doing more transfer experiments but decided to keep it minimal due to constraints. I'm interested in exploring that more.. I've had thoughts of doing a more in depth transfer evaluation framework for timm models, but it's quite a bit of work to setup across multiple datasets, models, and weights trained with different hparams and hparam search for the transfer settings.

Even just finding worthwhile datasets that are closer to what practitioners would see in the field are hard. I don't see much value in transferring to CIFAR for instance. INaturalist is good, but it is fairly large so could also train from scratch reasonably. We need some more 'small but real' datasets.

4 replies

AntixK Oct 6, 2021

Could you please shed some light (or some literature) on the seed effect in hyperparameter search? From the paper,

we use relatively coarse grid for hyper-parameter search to prevent introducing an additional seed effect.

Thanks!

rwightman Oct 7, 2021
Maintainer

@AntixK That wording was not the best, the 'seed effect' being referred to is the variation seen doing the training runs with same hparams across many seeds (A2 varied between 79.98 and 79.5 across 100 seeds).

For the same seed the initial model weights will always be the same, however there is also a lot of randomness in the dataset sample selection and augmentations during training (those have a significant impact too, I've tested decoupling weight seed from dataset/aug seed in the past). By changing your hparams you are altering the path of the training process and the sequence of random numbers generated or the operations done based on them (changing rand augment layers, samping, the # epochs, lr schedule, the batch sizes (and thus mixup/cutmix shuffles, beta samples), etc) ... so there will also be noise in the results from changing the hparam for the same seed.

So the coarse grid means keep each hparam search points far enough apart that's less likely you'll make a wrong decision re which grid points are better due to noise in the result. The proper way to do a finer search, if one had unlimited time and resources, would be to test each hparam variation on multiple seeds and take the mean.

mert-kurttutan Feb 6, 2023

Even just finding worthwhile datasets that are closer to what practitioners would see in the field are hard. I don't see much value in transferring to CIFAR for instance. INaturalist is good, but it is fairly large so could also train from scratch reasonably. We need some more 'small but real' datasets.

Hi @rwightman , I am curious if you worked (or are still thinking working) on the transfer learning for 'small but real' datasets.

luckyhug Apr 18, 2023

Even just finding worthwhile datasets that are closer to what practitioners would see in the field are hard. I don't see much value in transferring to CIFAR for instance. INaturalist is good, but it is fairly large so could also train from scratch reasonably. We need some more 'small but real' datasets.

Hi @rwightman , I am curious if you worked (or are still thinking working) on the transfer learning for 'small but real' datasets.

Me too! Or could you please shed some light (or some literature) on the transfer/finetuning performance for different timm models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResNet strikes back: And what about fine tuning? #901

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

ResNet strikes back: And what about fine tuning? #901

alexander-soare Oct 4, 2021

Replies: 1 comment · 4 replies

rwightman Oct 5, 2021 Maintainer

AntixK Oct 6, 2021

rwightman Oct 7, 2021 Maintainer

mert-kurttutan Feb 6, 2023

luckyhug Apr 18, 2023

alexander-soare
Oct 4, 2021

Replies: 1 comment 4 replies

rwightman
Oct 5, 2021
Maintainer

rwightman Oct 7, 2021
Maintainer