Segmentation fault when training baseline model #27

fragileness · 2017-12-26T01:56:48Z

I got message below when training baseline model:

...
[Iter: 299.1k / lr: 5.00e-5] Time: 66.29 (Data: 61.42) Err: 3.234126
[Iter: 299.2k / lr: 5.00e-5] Time: 65.32 (Data: 60.11) Err: 3.496183
[Iter: 299.3k / lr: 5.00e-5] Time: 66.40 (Data: 61.23) Err: 3.399313
[Iter: 299.4k / lr: 5.00e-5] Time: 64.99 (Data: 60.01) Err: 3.379927
[Iter: 299.5k / lr: 5.00e-5] Time: 65.95 (Data: 60.72) Err: 3.503887
[Iter: 299.6k / lr: 5.00e-5] Time: 66.23 (Data: 61.05) Err: 3.338660
[Iter: 299.7k / lr: 5.00e-5] Time: 65.30 (Data: 59.97) Err: 3.448611
[Iter: 299.8k / lr: 5.00e-5] Time: 65.69 (Data: 60.95) Err: 3.330575
[Iter: 299.9k / lr: 5.00e-5] Time: 66.04 (Data: 61.20) Err: 3.350167
[Iter: 300.0k / lr: 5.00e-5] Time: 65.34 (Data: 59.59) Err: 3.413485
[Epoch 300 (iter/epoch: 1000)] Test time: 25.48
(scale 2) Average PSNR: 35.5833 (Highest ever: 35.5902 at epoch = 288)

Segmentation fault (core dumped)

I'm not sure it the training process is successfully completed or not. If it is, where is the trained model?

limbee · 2017-12-27T08:04:24Z

I'm not sure why it prints the segmentation fault message, but the experiment is done successfully. Trained models are saved at experiment/

fragileness · 2017-12-27T08:13:57Z

I'm trying next step of training (item 1 in training.sh)
th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3
but seeing out of memory as below.
I've tried othee chopSize such as:
th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 -chopSize 16e0
but the situation remains the same.
How small can chopSize be set? Or is there any other options I can try?

loading model and criterion...
Creating model from file: models/baseline.lua
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/onegin/torch/install/bin/luajit: /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 22 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 1 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput'
/home/onegin/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/onegin/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function </home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

limbee · 2017-12-27T09:13:48Z

Try nResBlock=32 instead of 36, if you're using TitanX. We used 32 residual blocks when writing a paper since sometimes 12GB of GPU memory is not enough for 36 resblocks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when training baseline model #27

Segmentation fault when training baseline model #27

fragileness commented Dec 26, 2017

limbee commented Dec 27, 2017

fragileness commented Dec 27, 2017

limbee commented Dec 27, 2017 •

edited

Loading

Segmentation fault when training baseline model #27

Segmentation fault when training baseline model #27

Comments

fragileness commented Dec 26, 2017

I got message below when training baseline model:

Segmentation fault (core dumped)

limbee commented Dec 27, 2017

fragileness commented Dec 27, 2017

limbee commented Dec 27, 2017 • edited Loading

limbee commented Dec 27, 2017 •

edited

Loading