Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with training #12

Open
mafda opened this issue Jan 7, 2020 · 3 comments
Open

problems with training #12

mafda opened this issue Jan 7, 2020 · 3 comments

Comments

@mafda
Copy link

mafda commented Jan 7, 2020

Hi,

I successfully create the dataset, with the following settings:

MODEL_FILE=dragon.yml
TRAIN_SAMPLES=140000
VALID_SAMPLES=60000
MAX_TRANSLATION=0.02 
MAX_ROTATION=15      
BOUNDING_BOX=0       
RESOLUTION=174       
DATA_TYPE=numpy

and to train, I use the following settings:

OCCLUDER_PATH=hand
BACKGROUND_PATH=SUN3D
BATCH_SIZE=64
BACKEND=cuda 
EPOCH=100

However, my training does not converge and I have results like these:

200k_2018_1

200k_2018_2

I would like you to help me solve some questions:

  • Do you know how I can get better results with this data configuration?

  • Do you perform finetuning during training?

  • In addition to hand occlusion, do you use other occlusion models?

  • In relation to the background dataset, how do you use sun3d dataset? How many different places in the dataset do you use?

  • and finally, during the tests, how is the tracker initialized or reset?

Or any suggestions to overcome this problem?

Thanks

@MathGaron
Copy link
Contributor

Hey,

First, it seems like there is something quite wrong with those training curves, and would try first to understand why. Not sure how it is possible to have these kind of jumps. I can give you a few tips to help find the problem, do not hesitate to update me with anything you find.

  • For Sun3D, use as much scenes as possible, 8 seems low so make sure you don't bias the training with too few backgrounds.
  • You should see the network converge after ~20 epochs if you generate ~200k samples, no need to train it much further.
  • You can always train a simpler version to test. e.g. don't give occluders, don't use backgrounds. (You might have to change the training code slightly but this should help)
  • Try to visualize the network inputs (the two RGBD image given as input) Maybe there is a render bug, or any other bug. For RGB you can look at the unnormalized image for D you should check the numerical values (in mm) to make sure that everything is normal.
  • Normally you should not see such a gap between train and valid because they are the same type of data.

For your questions:

  • No need to finetune, the version using squeezenet can generalize from synthetic to real
  • I just use the hand model for occlusion (the one provided on the website)
  • During test, I suggest you to read the ECCV paper, the tracker is initialized depending on the scenario, if you have questions about the paper just ask!

@mafda
Copy link
Author

mafda commented Jan 8, 2020

Hi, thank you for the clarification.

I will train the network again with the suggestions you gave me.
I hope this helps me improve the training.

@mafda
Copy link
Author

mafda commented Feb 5, 2020

Hi,

I trained the network again with the suggestions you gave me:

  • Verify the generated data generate_dataset.sh. Looks ok to me.
    01

  • Train a network of 20 epochs without background and occlusion.
    02

  • Increase the background scenes from 8 to 62 scenes.

  • Perform training with additional background data and fewer epochs (20 epochs).
    04

However, as can be seen in the plots, these present an error between 2-4, in addition, it presents some strange jumps during training. So, I'm not sure that these results are correct. I used the pytorch_toolbox version tagged as v1.0.0.

Now, I have another questions about some configured values:

  • In the generate_dataset.sh script what bounding box value was used 0 or 10? BOUNDING_BOX = 0
  • For training, what weightdecay and learningrate values were set? In the paper I did not find a detailed description of the configuration parameters of the network training. Could you detail these values a little better?
  • I would like to know if it is possible for you to share with me one of the trained models model_train.pt, in order to be able to valid the model test script that I made.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants