Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when training baseline model #27

Open
fragileness opened this issue Dec 26, 2017 · 3 comments
Open

Segmentation fault when training baseline model #27

fragileness opened this issue Dec 26, 2017 · 3 comments

Comments

@fragileness
Copy link

I got message below when training baseline model:

...
[Iter: 299.1k / lr: 5.00e-5] Time: 66.29 (Data: 61.42) Err: 3.234126
[Iter: 299.2k / lr: 5.00e-5] Time: 65.32 (Data: 60.11) Err: 3.496183
[Iter: 299.3k / lr: 5.00e-5] Time: 66.40 (Data: 61.23) Err: 3.399313
[Iter: 299.4k / lr: 5.00e-5] Time: 64.99 (Data: 60.01) Err: 3.379927
[Iter: 299.5k / lr: 5.00e-5] Time: 65.95 (Data: 60.72) Err: 3.503887
[Iter: 299.6k / lr: 5.00e-5] Time: 66.23 (Data: 61.05) Err: 3.338660
[Iter: 299.7k / lr: 5.00e-5] Time: 65.30 (Data: 59.97) Err: 3.448611
[Iter: 299.8k / lr: 5.00e-5] Time: 65.69 (Data: 60.95) Err: 3.330575
[Iter: 299.9k / lr: 5.00e-5] Time: 66.04 (Data: 61.20) Err: 3.350167
[Iter: 300.0k / lr: 5.00e-5] Time: 65.34 (Data: 59.59) Err: 3.413485
[Epoch 300 (iter/epoch: 1000)] Test time: 25.48
(scale 2) Average PSNR: 35.5833 (Highest ever: 35.5902 at epoch = 288)

Segmentation fault (core dumped)

I'm not sure it the training process is successfully completed or not. If it is, where is the trained model?

@limbee
Copy link
Owner

limbee commented Dec 27, 2017

I'm not sure why it prints the segmentation fault message, but the experiment is done successfully. Trained models are saved at experiment/

@fragileness
Copy link
Author

I'm trying next step of training (item 1 in training.sh)
th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3
but seeing out of memory as below.
I've tried othee chopSize such as:
th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 -chopSize 16e0
but the situation remains the same.
How small can chopSize be set? Or is there any other options I can try?

loading model and criterion...
Creating model from file: models/baseline.lua
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/onegin/torch/install/bin/luajit: /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 22 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 1 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput'
/home/onegin/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/onegin/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function </home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

@limbee
Copy link
Owner

limbee commented Dec 27, 2017

Try nResBlock=32 instead of 36, if you're using TitanX. We used 32 residual blocks when writing a paper since sometimes 12GB of GPU memory is not enough for 36 resblocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants