-
Notifications
You must be signed in to change notification settings - Fork 10
runtraining.sh
This script runs CDeep3M training to generated what is known as a trained model. In the case of CDeep3M we are actually training 3 separate models which will be described below.
This script is actually a wrapper that invokes CreateTrainJob.m and run_all_train.sh
NOTE: If multiple GPUs are available this script will run the training in parallel
Example:
runtraining.sh --numiterations 1000 ~/augtrain ~/model
Usage:
usage: runtraining.sh [-h] [--1fmonly] [--numiterations NUMITERATIONS]
[--gpu GPU] [--base_lr BASE_LR] [--power POWER]
[--momentum MOMENTUM]
[--weight_decay WEIGHT_DECAY]
[--average_loss AVERAGE_LOSS]
[--lr_policy POLICY] [--iter_size ITER_SIZE]
[--snapshot_interval SNAPSHOT_INTERVAL]
[--validation_dir VALIDATION_DIR]
[--additerations NUMITERATIONS]
[--retrain TRAINOUTDIR]
augtrainimages trainoutdir
Version: 1.6.0
Trains Deep3M model using caffe with training data
passed into script.
For further information about parameters below please see:
https://github.com/BVLC/caffe/wiki/Solver-Prototxt
positional arguments:
augtrainimages Augmented training data from PreprocessTrainingData.m
trainoutdir Desired output directory
optional arguments:
-h, --help show this help message and exit
--1fmonly Only train 1fm model
--gpu Which GPU to use, can be a number ie 0 or 1 or
all to use all GPUs (default all)
--base_learn Base learning rate (default 1e-02)
--power Used in poly and sigmoid lr_policies. (default 0.8)
--momentum Indicates how much of the previous weight will be
retained in the new calculation. (default 0.9)
--weight_decay Factor of (regularization) penalization of large
weights (default 0.0005)
--average_loss Number of iterations to use to average loss
(default 16)
--lr_policy Learning rate policy (default poly)
--iter_size Accumulate gradients across batches through the
iter_size solver field. (default 8)
--snapshot_interval How often caffe should output a model and solverstate.
(default 2000)
--numiterations Number of training iterations to run (default 30000)
--validation_dir Augmented validation data
--retrain Continue training trained models from train directory
passed in here, writing results to trainoutdir
--additerations If --retrain is set, this value is added to the
latest iteration model file found in the
<retrain dir>/1fm/trainedmodel directory. For example,
if the latest iteration found in
<retrain>/1fm/trainedmodel is 10000 and
--additerations is set to 500 then training will
run to 10500 iterations. (default 2000)
This script will create a new directory, denoted as trainoutdir in usage above, which will be structured as follows:
Tree view of directory showing only base files and directories
├── 1fm
│ ├── log
│ ├── trainedmodel
├── 3fm
│ ├── log
│ ├── trainedmodel
├── 5fm
│ ├── log
│ ├── trainedmodel
├── parallel.jobs
├── readme.txt
├── valid_file.txt
└── train_file.txt
These directories contain the trained models and each one has an identical structure as seen here with actual files:
├── #fm
│ ├── deploy.prototxt
│ ├── label_class_selection.prototxt
│ ├── log
│ │ ├── caffe.bin.INFO
│ │ ├── caffe.bin.ip-XXX.ubuntu.log.INFO.XXXX
│ │ └── out.log
│ ├── solver.prototxt
│ ├── trainedmodel
│ │ ├── 1fm_classifer_iter_###.caffemodel
│ │ └── 1fm_classifer_iter_###.solverstate
│ ├── train_file.txt
│ ├── train_val.prototxt
│ └── valid_file.txt
The actual trained model resides under #fm/trainedmodel in the .caffemodel file.
The other file .solverstate is needed to resume training, but not needed for prediction.
The ### in the .caffemodel and .solverstate denote the iteration of the model file.
As caffe trains multiple .caffemodel files will be output so in this #fm/trainedmodel can exist multiple files at different iterations of completion.