problem about training own dataset #37

meroluo · 2019-02-20T12:30:30Z

hello, I find some problems when I train the model with my own dataset. Below are some description about my issue, hoping you can give me some suggestion about this problem, thank you very much !
2019-02-20_20-27-28

loading model and criterion...
Loading pre-trained model from: ../demo/model/EDSR_x4.t7
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
/home/luomeilu/torch/install/bin/luajit: ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: /var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory
stack traceback:
[C]: in function 'error'
/home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: in function 'load'
./data/div2k.lua:122: in function 'get'
./dataloader.lua:89: in function <./dataloader.lua:76>
[C]: in function 'xpcall'
...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
[C]: in function 'error'
...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
./dataloader.lua:158: in function '(for generator)'
./train.lua:69: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

limbee · 2019-02-21T02:02:59Z

/var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory

This line indicates that the training image is not located correctly, or you didn't specify the location of your dataset.
You should first define your own dataset parser, e.g., /data/code/yourdataset.lua, and update some other codes such as /code/opts.lua

meroluo · 2019-02-22T06:45:59Z

Thank you very much! Now I have put the dataset in the correct location.When I train the model, the setting of patch size is 256, error occurs as below:
2019-02-22_14-36-08

loading model and criterion...
Loading pre-trained model from: ../demo/model/EDSR_x2.t7
Load pre-trained SRResnet and change upsampler
Changing upsample layers
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
THCudaCheck FAIL file=/home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/luomeilu/torch/install/bin/luajit: /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 29 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 3 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput'
/home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function <.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

SO I wonder if I can set the size of patch smaller. if so, will it affect the effect of the training?

limbee · 2019-02-24T04:53:27Z

The number of channels is set to 256, and patch size is 96.

NTIRE2017/code/training.sh

Lines 1 to 2 in db34606

    
           # Bicubic scale 2 
        
           #th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3

This setting is suited for GPUs with 12GB of memory, so other GPUs with less than 12GB will probably give you an OOM error. You can change batch size or patch size using options

NTIRE2017/code/opts.lua

Line 49 in db34606

    
           cmd:option('-batchSize',        16,                 'Mini-batch size (1 = pure stochastic)')

NTIRE2017/code/opts.lua

Line 51 in db34606

cmd:option('-patchSize', 96, 'Training patch size')

Reducing the patch size may affect the final performance.

meroluo · 2019-02-27T02:25:09Z

Thank you for your suggestion! In the training, the scale is set to 4, I will try to change the batch size or patch size in the option.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem about training own dataset #37

problem about training own dataset #37

meroluo commented Feb 20, 2019

limbee commented Feb 21, 2019

meroluo commented Feb 22, 2019

limbee commented Feb 24, 2019

meroluo commented Feb 27, 2019

problem about training own dataset #37

problem about training own dataset #37

Comments

meroluo commented Feb 20, 2019

limbee commented Feb 21, 2019

meroluo commented Feb 22, 2019

limbee commented Feb 24, 2019

meroluo commented Feb 27, 2019