Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training slow down #12

Open
go2sea opened this issue Jan 29, 2019 · 6 comments
Open

training slow down #12

go2sea opened this issue Jan 29, 2019 · 6 comments

Comments

@go2sea
Copy link

go2sea commented Jan 29, 2019

I run the training code on 2 gpu, and I found that the training time increase about 7s every 1000 steps. And I tried adding torch.cuda.empty_cache() every 1000 steps, but it doesn't help. Is there any solution for that?

Thanks.

@nmhkahn
Copy link
Owner

nmhkahn commented Jan 29, 2019

Hi.
Is the training time cumulatively increased by 7s in every 1k steps?
I haven't plotted the wall-time graph, so I haven't known that issue and neither not sure how to solve it.
Sorry.

@yu45020
Copy link

yu45020 commented Jan 31, 2019

I am wondering whether it is caused by dataloader. You may set pin memory or split data equally, or rewrite the TrainDataset. It opens h5 under init. Another way is to open h5 under getitem, which enables multi read. I works well for sqlite, so I guess it will work for h5.

@Xingrun-Xing
Copy link

Hi, how long did you take for 1000 steps? My training is slow using 2 K80, it takes about 20 hours for 30000+ steps. Is it normal, and I think it's too slow.
ps: patch_size 64 batch_size 96

@feiyangha
Copy link

Hi, what are the training datasets you train the model? Just DIV2K? In the paper, there are three datasets used in training.
Thanks.

@nmhkahn
Copy link
Owner

nmhkahn commented May 9, 2019

@feiyangha
Just DIV2K.
The reason for describing three datasets is that they have been widely used, but we choose to use DIV2K.

@idealboy
Copy link

the dataset.py will load all the data in *.h5 into memory, so you must make sure that the memory is sufficient.

And then, your system may be disturbed by a high catch occupation, which will block your trainning for waiting memory allocating or swaping.

using 'htop' command to check your machine, and just try before trainning:
sync; echo 3 > /proc/sys/vm/drop_caches

and it wiil Clear PageCache, dentries and inodes.

May it help you, good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants