-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sacred makes pytorch training slow down #877
Comments
Hey @GoingMyWay! Thank you for noticing this and bringing this up. I never noticed this in my own experiments. I checked one of my old experiments and the tendency is there: The time required for one training step increases over training time. It's just not as severe as in your example.
I looked into the code of the run and the observer and found two things related to the stdout capturing could be causing this:
Both of these points are related to the stdout capturing. To pinpoint the issue, you could run your benchmark one time without an observer and one time without the print statement in the loop (if that is feasible for you. The number of iterations is quite large). If my assumptions are correct, then the train time should not increase when you don't print anything in the loop but it should not depend that much on the observer. To support my first point, I plotted the time required to append a string to another string for different string lengths. import time
times = []
lengths = []
for l in range(1, 3_500_000 * 100, 100000):
s = '_' * l
st = time.time()
s = s + 'a' * 100
times.append(time.time() - st)
lengths.append(l) This kind of matches your plot. |
@thequilo, Hi, thanks for pointing out the reason. Will you fix this issue in the next release? I would like to fix it, but I am not an expert on Sacred. |
If I figure out a way to fix this, then yes. But I'm not sure yet how to do it without potentially breaking things (like custom observers). I would suggest avoiding frequent print or logging statements as a workaround. |
@thequilo, thanks. I'll reduce printing and logging. |
I have been using sacred for a while and I found there is an issue. That is, while training with Pytorch on GPUs, the training speed slows down.
OS: Ubuntu; Python: 3.9; tested GPUs: A100 or 3090; Pytorch: 1.9 or newer; Sacred: the newest one.
Reproducible code (no training):
Without sacred, the speed is stable:
With sacred, the speed is increasing over time (two figures):
During training, with sacred, such an issue is even severe and out of control. The following is an example:
As you can see, the time cost per update is increasing, causing the ETA out of control.
The text was updated successfully, but these errors were encountered: