Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem about training own dataset #37

Open
meroluo opened this issue Feb 20, 2019 · 4 comments
Open

problem about training own dataset #37

meroluo opened this issue Feb 20, 2019 · 4 comments

Comments

@meroluo
Copy link

meroluo commented Feb 20, 2019

hello, I find some problems when I train the model with my own dataset. Below are some description about my issue, hoping you can give me some suggestion about this problem, thank you very much !
2019-02-20_20-27-28

loading model and criterion...
Loading pre-trained model from: ../demo/model/EDSR_x4.t7
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
/home/luomeilu/torch/install/bin/luajit: ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: /var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory
stack traceback:
[C]: in function 'error'
/home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: in function 'load'
./data/div2k.lua:122: in function 'get'
./dataloader.lua:89: in function <./dataloader.lua:76>
[C]: in function 'xpcall'
...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
[C]: in function 'error'
...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
./dataloader.lua:158: in function '(for generator)'
./train.lua:69: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

@limbee
Copy link
Owner

limbee commented Feb 21, 2019

/var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory

This line indicates that the training image is not located correctly, or you didn't specify the location of your dataset.
You should first define your own dataset parser, e.g., /data/code/yourdataset.lua, and update some other codes such as /code/opts.lua

@meroluo
Copy link
Author

meroluo commented Feb 22, 2019

Thank you very much! Now I have put the dataset in the correct location.When I train the model, the setting of patch size is 256, error occurs as below:
2019-02-22_14-36-08

loading model and criterion...
Loading pre-trained model from: ../demo/model/EDSR_x2.t7
Load pre-trained SRResnet and change upsampler
Changing upsample layers
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
THCudaCheck FAIL file=/home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/luomeilu/torch/install/bin/luajit: /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 29 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 3 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput'
/home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function <.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

SO I wonder if I can set the size of patch smaller. if so, will it affect the effect of the training?

@limbee
Copy link
Owner

limbee commented Feb 24, 2019

The number of channels is set to 256, and patch size is 96.

# Bicubic scale 2
#th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3

This setting is suited for GPUs with 12GB of memory, so other GPUs with less than 12GB will probably give you an OOM error. You can change batch size or patch size using options

cmd:option('-batchSize', 16, 'Mini-batch size (1 = pure stochastic)')

cmd:option('-patchSize', 96, 'Training patch size')

Reducing the patch size may affect the final performance.

@meroluo
Copy link
Author

meroluo commented Feb 27, 2019

Thank you for your suggestion! In the training, the scale is set to 4, I will try to change the batch size or patch size in the option.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants