Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize iterator only once per fit call #835

Merged
merged 4 commits into from
Feb 13, 2022

Conversation

BenjaminBossan
Copy link
Collaborator

Description

At the moment, we initialize the iterators for the training and
validation datasets once per epoch. However, this is unnecessary and
creates an (ever so small) overhead.

With this PR, the iterators are created once per fit call only.

Implementation

Theoretically, this could present a backwards compatibility issue, but I
think it will only affect very few, if any, users; those who are
affected should be able to fix the issue quickly.

I don't believe that anyone would rely on the iterators being
initialized once per epoch as a feature. If there is a good use case for
actually doing so, please tell me and we can discuss if this PR should
actually be rejected.

To test the change in terms of performance, I ran the MNIST benchmark.
(While doing so, I fixed a few issues with the script.). The difference
of this PR on this benchmark is not noticeable. I would only expect a
performance difference on very small datasets. Still, I believe the
benefits outweigh the costs.

Side note

The test_pickle_load failed for me locally when cuda_available was set
to False. I'm not exactly sure what the reason is, it could be that the
way we patch torch.cuda.is_available breaks with some recent changes in
PyTorch (tested on 1.8.1).

At the moment, we initialize the iterators for the training and
validation datasets once per epoch. However, this is unnecessary and
creates an (ever so small) overhead.

With this PR, the iterators are created once per fit call only.

Theoretically, this could present a backwards compatibility issue, but I
think it will only affect very few, if any, users; those who are
affected should be able to fix the issue quickly.

I don't believe that anyone would rely on the iterators being
initialized once per epoch as a feature. If there is a good use case for
actually doing so, please tell me and we can discuss if this PR should
actually be rejected.

To test the change in terms of performance, I ran the MNIST benchmark.
(While doing so, I fixed a few issues with the script.). The difference
of this PR on this benchmark is not noticeable. I would only expect a
performance difference on very small datasets. Still, I believe the
benefits outweigh the costs.

Side note:

The test_pickle_load failed for me locally when cuda_available was set
to False. I'm not exactly sure what the reason is, it could be that the
way we patch torch.cuda.is_available breaks with some recent changes in
PyTorch.
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at code around GitHub, there is some code that subclass NeuralNet and override run_single_epoch. For example:

https://github.com/shivar2/MotorImagery-Classification/blob/deaf89cb77b02b3033595e21bb54b0607e4b4d1a/Code/Classifier/EEGTLClassifier.py#L28

https://github.com/kqf/coin-detector/blob/29c26bd34489a11a381e995de13c5735cb1e6d4a/detectors/detnet.py#L35-L37

Passing in a different object into run_single_epoch would be BC breaking for the above. If one writes a custom fit_loop and passes the dataset into run_single_epoch, that would break as well.

Looking through the DataLoader source I do not see any costly operations. I agree it would be nice to initialize the loader once, but there is a significant BC cost.

@BenjaminBossan
Copy link
Collaborator Author

Thanks for finding these examples @thomasjpfan. I didn't think about searching github for instances (was that always possible?).

I'm thinking about possible mitigations to the BC issue.

  1. We could catch the error and tell users to replace for batch in self.get_iterator(dataset, training=training): by for batch in iterator:.
  2. We could catch the error, raise a warning, and then call run_single_epoch with the dataset instead (I don't like this)
  3. Add a metaclass that checks if run_single_epoch is being overridden and then ... okay, this is getting ridiculous ;)
  4. On top of that, we can add a section to the docs on how to change the code when migrating to skorch 0.12.

Any other proposals how to help users transition?

Looking through the DataLoader source I do not see any costly operations. I agree it would be nice to initialize the loader once, but there is a significant BC cost.

One more reason to initialize the loader only once is that it corresponds more closely to the "vanilla PyTorch fit loop" one can find in many tutorials (e.g. here). But yeah, the performance gain is minimal.

@thomasjpfan
Copy link
Member

We could catch the error and tell users to replace for batch in self.get_iterator

For a custom run_single_epoch, I do not see a good place to catch the error. We can do it in fit_loop but we can not be sure that run_single_epoch will error when passed a dataset. We can assume a developer will use get_iterator in their run_single_epoch implementation and catch it there.

For a custom fit_loop and not run_single_epoch, we can catch in our implementation of run_single_epoch for dataset. If we want to be extra safe, then we do a deprecation cycle.

Technically I think it's possible to migrate, but I do not know if it is worth it.

@BenjaminBossan
Copy link
Collaborator Author

For a custom run_single_epoch, I do not see a good place to catch the error. We can do it in fit_loop but we can not be sure that run_single_epoch will error when passed a dataset. We can assume a developer will use get_iterator in their run_single_epoch implementation and catch it there.

I tried what would happen and the error I got was:

TypeError: 'DataLoader' object is not subscriptable

which is unsurprising. So the way to catch it would be to wrap

skorch/skorch/net.py

Lines 1087 to 1088 in c58ae67

self.run_single_epoch(dataset_train, training=True, prefix="train",
step_fn=self.train_step, **fit_params)
inside a try ... except TypeError and inform the user about the mitigation.

Alternatively, we could type check the input to get_iterator. If it's already a DataLoader, we could give a DeprecationWarning with an instruction about the mitigation and then immediately return the "dataset" (which really is a DataLoader):

    def get_iterator(self, dataset, training=False):
        if isinstance(dataset, DataLoader):
            warnings.warn("helpful message", DeprecationWarning)
            return dataset
        # old code below

Technically I think it's possible to migrate, but I do not know if it is worth it.

The reason why I think it could be worth it is that someone who ports their vanilla PyTorch code to skorch should have as few gotchas as possible. Instantiating the DataLoader once per epoch instead of once per fit call would be one such gotcha that we could avoid. AFAICT, there is no technical reason why we need to instantiate it once per epoch, the linked examples don't make use of that functionality.

@thomasjpfan
Copy link
Member

Let's go with wrapping get_iterator, going through deprecation cycle, and linking to a migration guide. The migration itself is simple enough.

Users who override run_single_epoch will not get an error but we give a
DeprecationWarning, including an instruction on how to migrate.
@BenjaminBossan
Copy link
Collaborator Author

@thomasjpfan I pushed a change to make this PR backward compatible and added instructions on how to migrate.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BenjaminBossan
Copy link
Collaborator Author

@thomasjpfan I added a bit more documentation after your approval, I hope it's fine that I still merged.

After discussing with @ottonemo, he also mentioned that re-initializing the DataLoader might not be optimal when using it with multi-processing, so this could be another argument in favor of this switch.

@BenjaminBossan BenjaminBossan merged commit f5bb1a5 into master Feb 13, 2022
@BenjaminBossan BenjaminBossan deleted the initialize-iterator-only-once-per-fit-call branch July 22, 2022 12:43
BenjaminBossan added a commit that referenced this pull request May 8, 2023
In #835, we made a change to the data loader initilization. To keep
backwards compatbility, we had to add extra code to get_iterator,
together with a deprecation notice. The deprecation period is over, so
that code is removed, as well as the associated test.
@BenjaminBossan BenjaminBossan mentioned this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants