-
Notifications
You must be signed in to change notification settings - Fork 74
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training slow down #12
Comments
Hi. |
I am wondering whether it is caused by dataloader. You may set pin memory or split data equally, or rewrite the TrainDataset. It opens h5 under init. Another way is to open h5 under getitem, which enables multi read. I works well for sqlite, so I guess it will work for h5. |
Hi, how long did you take for 1000 steps? My training is slow using 2 K80, it takes about 20 hours for 30000+ steps. Is it normal, and I think it's too slow. |
Hi, what are the training datasets you train the model? Just DIV2K? In the paper, there are three datasets used in training. |
@feiyangha |
the dataset.py will load all the data in *.h5 into memory, so you must make sure that the memory is sufficient. And then, your system may be disturbed by a high catch occupation, which will block your trainning for waiting memory allocating or swaping. using 'htop' command to check your machine, and just try before trainning: and it wiil Clear PageCache, dentries and inodes. May it help you, good luck. |
I run the training code on 2 gpu, and I found that the training time increase about 7s every 1000 steps. And I tried adding torch.cuda.empty_cache() every 1000 steps, but it doesn't help. Is there any solution for that?
Thanks.
The text was updated successfully, but these errors were encountered: