Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GoogLeNet (Inception v1) #678

Merged
merged 14 commits into from
Mar 7, 2019
Merged

Add GoogLeNet (Inception v1) #678

merged 14 commits into from
Mar 7, 2019

Conversation

TheCodez
Copy link
Contributor

@TheCodez TheCodez commented Dec 8, 2018

This adds the GoogLeNet (Inception v1) model, including the auxiliary classifiers found in the paper. Related to Issue #537

I have updated the example project on my branch to add support for training the GoogLeNet. Sadly I don't have the computing resources available to train the model on ImageNet.

@TheCodez
Copy link
Contributor Author

@fmassa what do you think?

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks great, thanks!

But we need some pre-trained weights before this can be merged.

I can launch some trainings to see if I manage to obtain the reported results.
Do you know what hyperparameters to use to launch training?

From my understanding, reproducing the results of GoogleNet was kind of hard though.

@TheCodez
Copy link
Contributor Author

Those are the parameters used by the Berkely Vision GoogLeNet.
My branch also includes the learning rate decay described in the paper.
https://github.com/TheCodez/examples/blob/inception/imagenet/README.md#googlenet

python main.py -a googlenet --lr 0.01 --wd 0.0002 [imagenet-folder with train and val folders]

@fmassa
Copy link
Member

fmassa commented Dec 21, 2018

Awesome, thanks!

I'll try running the imagenet examples with your model and I'll report back once training is over

@TheCodez
Copy link
Contributor Author

@fmassa I just noticed, I missed a padding in maxpool3 which resulted in the output size being 13x13x480 instead of the correct 14x14x480.
Is there any difference in using ceil_mode=True vs. padding=1 in the max pools?

@TheCodez
Copy link
Contributor Author

TheCodez commented Jan 5, 2019

I have also added a batch normalized GoogLeNet version which could be trained using:

python main.py -a googlenet_bn --lr 0.045 --wd 0.0002 [imagenet-folder with train and val folders]

I'm not entirely sure which learning rate decay strategy would be best suited, maybe the default one from the imagenet script or the 6 times faster than the GoogLeNet decay from the BatchNorm paper.

Also for the original GoogLeNet, maybe the poly decay policy used by the Berkeley Vision GoogLeNet would be better suited than the one found in the paper.

@fmassa
Copy link
Member

fmassa commented Jan 8, 2019

Hi,

I just got back from holidays. The training that I launched didn't learn anything.

Is there any difference in using ceil_mode=True vs. padding=1 in the max pools?

There is a difference between ceil_mode=True and padding=1, and they are complementary: ceil_mode corresponds to the rounding mode used (i.e., ceil(input / stride), for the simplest example), while padding always add the specified values.

I can try running more training jobs, if you have the hyper-parameters to try.

@TheCodez
Copy link
Contributor Author

TheCodez commented Jan 8, 2019

@fmassa the batch normalized version should be easier to train:

python main.py -a googlenet_bn --lr 0.045 --wd 0.00004 [imagenet-folder with train and val folders]

For the regular GoogLeNet:

python main.py -a googlenet --lr 0.01 --wd 0.0002 [imagenet-folder with train and val folders]

I have updated the training script adding some more data augmentation ColorJitter and LightingNoise, which are reduced in strength for the batch normalized version.
Also I'm now using the poly learning rate decay whereas the batch normalized version uses the default one. Script is here: https://github.com/TheCodez/examples/tree/inception/imagenet

Hopefully it succeds this time.

@fmassa
Copy link
Member

fmassa commented Jan 8, 2019

I launched another training with googlenet using your newer commits. Let's see what it gives

@fmassa
Copy link
Member

fmassa commented Jan 8, 2019

@TheCodez Doesn't look like it's learning anything on 8GPUs, with the defaults that you sent. It has run through 5 epochs already. I'll let it run overnight, but this seems to align with my expectations that reproducing their results might not be easy

@TheCodez
Copy link
Contributor Author

TheCodez commented Jan 8, 2019

@fmassa

The only obvious differences I notice to the Berkeley Vision version are:

  • Using a padding of 1 in the max pool layers to achieve the right output sizes. Caffe uses ceil for rounding, so no padding is needed in that case.
  • Using the initialization scheme from tensorflow instead of "xavier".
  • Using more advanced data augmentations, that are also used in the paper.

They needed 60 epochs (2,400,000 iterations) to achieve top-1 accuracy 68.7% (31.3% error) and a top-5 accuracy 88.9% (11.1% error) using the fast solver. In theory we should get better results because of the data augmentations.
My branch also includes the poly learning rate decay they use in the fast solver, maybe there is an error. I got the code for that from here. I think this line for calculating the current iteration is not quite correct: iter = len(train_loader) * epoch + i instead it might just be iter = i * (epoch + 1) instead?

Other than that my only idea is using a different learning rate to start with or trying the batch normalized version first and see if that works better.

@fmassa
Copy link
Member

fmassa commented Jan 9, 2019

Hi @TheCodez ,

Here are my thoughts:

  • we should try using ceil=True mode, instead of adding a padding
  • we should use the exact same initialization from what they reported in the paper, this plays a big difference
  • the data augmentations are not what is making the training not work at all

I believe the the fact is not training at all is probably a combination of weight initialization, data transformation (are their images in 0-1 or 0-255?). Can you double check that on top of existing implementations that are known to work?

@codecov-io
Copy link

codecov-io commented Jan 9, 2019

Codecov Report

Merging #678 into master will decrease coverage by 1.08%.
The diff coverage is 16.53%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #678      +/-   ##
=========================================
- Coverage   40.99%   39.9%   -1.09%     
=========================================
  Files          29      30       +1     
  Lines        2747    2874     +127     
  Branches      432     445      +13     
=========================================
+ Hits         1126    1147      +21     
- Misses       1542    1648     +106     
  Partials       79      79
Impacted Files Coverage Δ
torchvision/models/__init__.py 100% <100%> (ø) ⬆️
torchvision/models/googlenet.py 15.87% <15.87%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7d8898...23562cf. Read the comment docs.

@TheCodez
Copy link
Contributor Author

TheCodez commented Jan 9, 2019

@fmassa I changed the weight initialization to xavier and added ceil mode. I used the initialization from the BVLC GoogLeNet as it is the only reproducible version I found. I couldn't really tell which image ranges were used. Do you know what the default for Caffe is?
[EDIT] This is what I found for CaffeNet example code, so it should probably be the same for the GoogLeNet:

Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them.

But shouldn't this only matter when the model is already pretrained?

Also updated my training script to get rid off the line I wasn't sure about. TheCodez/examples@dde173d

@TheCodez TheCodez force-pushed the inception branch 2 times, most recently from 8bc54e2 to 0bfb10e Compare January 9, 2019 22:17
@TheCodez
Copy link
Contributor Author

@fmassa could you please test again, if you have time?

@fmassa
Copy link
Member

fmassa commented Feb 13, 2019

Hi @TheCodez ,

Sorry for the delay.

I didn't have the chance to try it again since last time. I think for googlenet, given that it seems to be hard to train from scratch, I think I'll end up providing only pre-trained versions (by reusing the original weights), and not enforce that they will have training code associated with it.

In this case, it would make sense to put it in torchhub, but I still need to figure out the details of how to do it.

@TheCodez
Copy link
Contributor Author

TheCodez commented Feb 13, 2019

@fmassa my ideas would be:

  • Train again with the updated code. According to GoogLeNet not training soumith/imagenet-multiGPU.torch#2 (comment) using the Xavier initialization, a torch version of the model was able to train correctly. Also the BVLC GoogLeNet was trained successfully using xavier initialization, so it might be worth a shot.

  • Try the batch norm version, which should be a lot easier to train. Maybe convert those weights for the non batch norm version or only provide the batch norm version.

  • Convert weights from the tensorflow model, which slightly differs from the paper definiton. I think it also uses batch norm.

What do you think?

@fmassa
Copy link
Member

fmassa commented Feb 26, 2019

@TheCodez I'm currently leaning towards adapting the weights from TensorFlow, and providing it on the Hub.

@TheCodez
Copy link
Contributor Author

@fmassa I'll be trying to convert the weights from tensorflow.

@TheCodez
Copy link
Contributor Author

TheCodez commented Feb 28, 2019

@fmassa I converted the weights. I haven't done extensive evaluation on the ImageNet validation set, instead I just tested some images and all of them were classified correctly.
I used a modified version of https://github.com/Cadene/tensorflow-model-zoo.torch to convert the weights, also had to do some modifications to the network itself to match the Tensorflow model.

The problem I'm currently having is that in Tensorflow the ImageNet dataset has 1001 classes, which leads to me having to substract 1 from the prediction to get the correct classification.

@colesbury how did you convert the 1001 to 1000 classes in the InceptionV3 model?

@fmassa
Copy link
Member

fmassa commented Mar 6, 2019

@TheCodez I think that we can remove the first row of the classifier weights, which should remove the first class and should be fine I think?

@TheCodez
Copy link
Contributor Author

TheCodez commented Mar 6, 2019

@fmassa I've updated the code to match the structure required for the TensorFlow weights. Also added the input normalization used for the Inception v3 model.

Removing the first row did the trick. Weights are currently hosted here: https://github.com/TheCodez/vision/releases/tag/1.0

@fmassa
Copy link
Member

fmassa commented Mar 7, 2019

@TheCodez thanks for the update! I'm evaluating the pre-trained model on ImageNet to compare the results and I'll let you know

@fmassa
Copy link
Member

fmassa commented Mar 7, 2019

I got

Acc@1 68.332 Acc@5 88.552

which seems about right, thanks! I'm uploading the pre-trained weights, and then I'll update the download path and the documentation, thanks!

self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
self.maxpool4 = nn.MaxPool2d(3, stride=2, ceil_mode=True)
Copy link
Contributor Author

@TheCodez TheCodez Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa one thing to note here is that TensorFlow uses 2x2 pooling here instead of 3x3. Don't know if that has a positive impact on the accuracy, but it would mean to further diverge from the paper definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason why you used 3x3 pooling, in order to make everything work out fine, given the differences between TF and PyTorch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't cause problems during my conversion process so I probably just missed it. Should I change it? In that case it might be a good idea to add a note that the implementation differs from the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me try seeing if it makes a difference for the performance, and I'll let you know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accuracy is 1 point better using the 2x2 pooling, with

Acc@1 69.778 Acc@5 89.530

so I'll be changing it. Thanks for the heads up

@fmassa fmassa merged commit a209300 into pytorch:master Mar 7, 2019
@fmassa
Copy link
Member

fmassa commented Mar 7, 2019

Thanks a lot @TheCodez !

@TheCodez
Copy link
Contributor Author

TheCodez commented Mar 7, 2019

@fmassa Thank you for all the help and guidance 👍

@fmassa
Copy link
Member

fmassa commented Mar 7, 2019

Thanks for the PR! Keep the amazing work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants