training slow down #12

go2sea · 2019-01-29T16:45:48Z

I run the training code on 2 gpu, and I found that the training time increase about 7s every 1000 steps. And I tried adding torch.cuda.empty_cache() every 1000 steps, but it doesn't help. Is there any solution for that?

Thanks.

nmhkahn · 2019-01-29T23:27:15Z

Hi.
Is the training time cumulatively increased by 7s in every 1k steps?
I haven't plotted the wall-time graph, so I haven't known that issue and neither not sure how to solve it.
Sorry.

yu45020 · 2019-01-31T06:09:30Z

I am wondering whether it is caused by dataloader. You may set pin memory or split data equally, or rewrite the TrainDataset. It opens h5 under init. Another way is to open h5 under getitem, which enables multi read. I works well for sqlite, so I guess it will work for h5.

Xingrun-Xing · 2019-03-16T15:13:57Z

Hi, how long did you take for 1000 steps? My training is slow using 2 K80, it takes about 20 hours for 30000+ steps. Is it normal, and I think it's too slow.
ps: patch_size 64 batch_size 96

feiyangha · 2019-05-09T08:53:30Z

Hi, what are the training datasets you train the model? Just DIV2K? In the paper, there are three datasets used in training.
Thanks.

nmhkahn · 2019-05-09T13:50:40Z

@feiyangha
Just DIV2K.
The reason for describing three datasets is that they have been widely used, but we choose to use DIV2K.

idealboy · 2019-07-26T09:59:02Z

the dataset.py will load all the data in *.h5 into memory, so you must make sure that the memory is sufficient.

And then, your system may be disturbed by a high catch occupation, which will block your trainning for waiting memory allocating or swaping.

using 'htop' command to check your machine, and just try before trainning:
sync; echo 3 > /proc/sys/vm/drop_caches

and it wiil Clear PageCache, dentries and inodes.

May it help you, good luck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training slow down #12

training slow down #12

go2sea commented Jan 29, 2019

nmhkahn commented Jan 29, 2019

yu45020 commented Jan 31, 2019 •

edited

Loading

Xingrun-Xing commented Mar 16, 2019

feiyangha commented May 9, 2019

nmhkahn commented May 9, 2019

idealboy commented Jul 26, 2019

training slow down #12

training slow down #12

Comments

go2sea commented Jan 29, 2019

nmhkahn commented Jan 29, 2019

yu45020 commented Jan 31, 2019 • edited Loading

Xingrun-Xing commented Mar 16, 2019

feiyangha commented May 9, 2019

nmhkahn commented May 9, 2019

idealboy commented Jul 26, 2019

yu45020 commented Jan 31, 2019 •

edited

Loading