-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize iterator only once per fit call #835
Initialize iterator only once per fit call #835
Conversation
At the moment, we initialize the iterators for the training and validation datasets once per epoch. However, this is unnecessary and creates an (ever so small) overhead. With this PR, the iterators are created once per fit call only. Theoretically, this could present a backwards compatibility issue, but I think it will only affect very few, if any, users; those who are affected should be able to fix the issue quickly. I don't believe that anyone would rely on the iterators being initialized once per epoch as a feature. If there is a good use case for actually doing so, please tell me and we can discuss if this PR should actually be rejected. To test the change in terms of performance, I ran the MNIST benchmark. (While doing so, I fixed a few issues with the script.). The difference of this PR on this benchmark is not noticeable. I would only expect a performance difference on very small datasets. Still, I believe the benefits outweigh the costs. Side note: The test_pickle_load failed for me locally when cuda_available was set to False. I'm not exactly sure what the reason is, it could be that the way we patch torch.cuda.is_available breaks with some recent changes in PyTorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at code around GitHub, there is some code that subclass NeuralNet
and override run_single_epoch
. For example:
Passing in a different object into run_single_epoch
would be BC breaking for the above. If one writes a custom fit_loop
and passes the dataset into run_single_epoch
, that would break as well.
Looking through the DataLoader
source I do not see any costly operations. I agree it would be nice to initialize the loader once, but there is a significant BC cost.
Thanks for finding these examples @thomasjpfan. I didn't think about searching github for instances (was that always possible?). I'm thinking about possible mitigations to the BC issue.
Any other proposals how to help users transition?
One more reason to initialize the loader only once is that it corresponds more closely to the "vanilla PyTorch fit loop" one can find in many tutorials (e.g. here). But yeah, the performance gain is minimal. |
For a custom For a custom Technically I think it's possible to migrate, but I do not know if it is worth it. |
I tried what would happen and the error I got was:
which is unsurprising. So the way to catch it would be to wrap Lines 1087 to 1088 in c58ae67
try ... except TypeError and inform the user about the mitigation.
Alternatively, we could type check the input to def get_iterator(self, dataset, training=False):
if isinstance(dataset, DataLoader):
warnings.warn("helpful message", DeprecationWarning)
return dataset
# old code below
The reason why I think it could be worth it is that someone who ports their vanilla PyTorch code to skorch should have as few gotchas as possible. Instantiating the DataLoader once per epoch instead of once per fit call would be one such gotcha that we could avoid. AFAICT, there is no technical reason why we need to instantiate it once per epoch, the linked examples don't make use of that functionality. |
Let's go with wrapping |
Users who override run_single_epoch will not get an error but we give a DeprecationWarning, including an instruction on how to migrate.
@thomasjpfan I pushed a change to make this PR backward compatible and added instructions on how to migrate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@thomasjpfan I added a bit more documentation after your approval, I hope it's fine that I still merged. After discussing with @ottonemo, he also mentioned that re-initializing the |
In #835, we made a change to the data loader initilization. To keep backwards compatbility, we had to add extra code to get_iterator, together with a deprecation notice. The deprecation period is over, so that code is removed, as well as the associated test.
Description
At the moment, we initialize the iterators for the training and
validation datasets once per epoch. However, this is unnecessary and
creates an (ever so small) overhead.
With this PR, the iterators are created once per fit call only.
Implementation
Theoretically, this could present a backwards compatibility issue, but I
think it will only affect very few, if any, users; those who are
affected should be able to fix the issue quickly.
I don't believe that anyone would rely on the iterators being
initialized once per epoch as a feature. If there is a good use case for
actually doing so, please tell me and we can discuss if this PR should
actually be rejected.
To test the change in terms of performance, I ran the MNIST benchmark.
(While doing so, I fixed a few issues with the script.). The difference
of this PR on this benchmark is not noticeable. I would only expect a
performance difference on very small datasets. Still, I believe the
benefits outweigh the costs.
Side note
The
test_pickle_load
failed for me locally whencuda_available
was setto False. I'm not exactly sure what the reason is, it could be that the
way we patch
torch.cuda.is_available
breaks with some recent changes inPyTorch (tested on 1.8.1).