problems with training #12

mafda · 2020-01-07T19:37:36Z

Hi,

I successfully create the dataset, with the following settings:

MODEL_FILE=dragon.yml
TRAIN_SAMPLES=140000
VALID_SAMPLES=60000
MAX_TRANSLATION=0.02 
MAX_ROTATION=15      
BOUNDING_BOX=0       
RESOLUTION=174       
DATA_TYPE=numpy

and to train, I use the following settings:

OCCLUDER_PATH=hand
BACKGROUND_PATH=SUN3D
BATCH_SIZE=64
BACKEND=cuda 
EPOCH=100

hand occluder dataset
SUN3D (all frames from only 8 different places)
- subfolder_name ->
  - 0.png (rgb image)
  - 0d.png (depth image)

However, my training does not converge and I have results like these:

I would like you to help me solve some questions:

Do you know how I can get better results with this data configuration?
Do you perform finetuning during training?
In addition to hand occlusion, do you use other occlusion models?
In relation to the background dataset, how do you use sun3d dataset? How many different places in the dataset do you use?
and finally, during the tests, how is the tracker initialized or reset?

Or any suggestions to overcome this problem?

Thanks

The text was updated successfully, but these errors were encountered:

MathGaron · 2020-01-08T14:31:25Z

Hey,

First, it seems like there is something quite wrong with those training curves, and would try first to understand why. Not sure how it is possible to have these kind of jumps. I can give you a few tips to help find the problem, do not hesitate to update me with anything you find.

For Sun3D, use as much scenes as possible, 8 seems low so make sure you don't bias the training with too few backgrounds.
You should see the network converge after ~20 epochs if you generate ~200k samples, no need to train it much further.
You can always train a simpler version to test. e.g. don't give occluders, don't use backgrounds. (You might have to change the training code slightly but this should help)
Try to visualize the network inputs (the two RGBD image given as input) Maybe there is a render bug, or any other bug. For RGB you can look at the unnormalized image for D you should check the numerical values (in mm) to make sure that everything is normal.
Normally you should not see such a gap between train and valid because they are the same type of data.

For your questions:

No need to finetune, the version using squeezenet can generalize from synthetic to real
I just use the hand model for occlusion (the one provided on the website)
During test, I suggest you to read the ECCV paper, the tracker is initialized depending on the scenario, if you have questions about the paper just ask!

mafda · 2020-01-08T15:21:42Z

Hi, thank you for the clarification.

I will train the network again with the suggestions you gave me.
I hope this helps me improve the training.

mafda · 2020-02-05T16:25:38Z

Hi,

I trained the network again with the suggestions you gave me:

Verify the generated data generate_dataset.sh. Looks ok to me.
Train a network of 20 epochs without background and occlusion.
Increase the background scenes from 8 to 62 scenes.
Perform training with additional background data and fewer epochs (20 epochs).

However, as can be seen in the plots, these present an error between 2-4, in addition, it presents some strange jumps during training. So, I'm not sure that these results are correct. I used the pytorch_toolbox version tagged as v1.0.0.

Now, I have another questions about some configured values:

In the generate_dataset.sh script what bounding box value was used 0 or 10? BOUNDING_BOX = 0
For training, what weightdecay and learningrate values were set? In the paper I did not find a detailed description of the configuration parameters of the network training. Could you detail these values a little better?
I would like to know if it is possible for you to share with me one of the trained models model_train.pt, in order to be able to valid the model test script that I made.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with training #12

problems with training #12

mafda commented Jan 7, 2020

MathGaron commented Jan 8, 2020

mafda commented Jan 8, 2020

mafda commented Feb 5, 2020

problems with training #12

problems with training #12

Comments

mafda commented Jan 7, 2020

MathGaron commented Jan 8, 2020

mafda commented Jan 8, 2020

mafda commented Feb 5, 2020