Initialize iterator only once per fit call #835

BenjaminBossan · 2022-02-05T16:22:44Z

Description

At the moment, we initialize the iterators for the training and
validation datasets once per epoch. However, this is unnecessary and
creates an (ever so small) overhead.

With this PR, the iterators are created once per fit call only.

Implementation

Theoretically, this could present a backwards compatibility issue, but I
think it will only affect very few, if any, users; those who are
affected should be able to fix the issue quickly.

I don't believe that anyone would rely on the iterators being
initialized once per epoch as a feature. If there is a good use case for
actually doing so, please tell me and we can discuss if this PR should
actually be rejected.

To test the change in terms of performance, I ran the MNIST benchmark.
(While doing so, I fixed a few issues with the script.). The difference
of this PR on this benchmark is not noticeable. I would only expect a
performance difference on very small datasets. Still, I believe the
benefits outweigh the costs.

Side note

The test_pickle_load failed for me locally when cuda_available was set
to False. I'm not exactly sure what the reason is, it could be that the
way we patch torch.cuda.is_available breaks with some recent changes in
PyTorch (tested on 1.8.1).

At the moment, we initialize the iterators for the training and validation datasets once per epoch. However, this is unnecessary and creates an (ever so small) overhead. With this PR, the iterators are created once per fit call only. Theoretically, this could present a backwards compatibility issue, but I think it will only affect very few, if any, users; those who are affected should be able to fix the issue quickly. I don't believe that anyone would rely on the iterators being initialized once per epoch as a feature. If there is a good use case for actually doing so, please tell me and we can discuss if this PR should actually be rejected. To test the change in terms of performance, I ran the MNIST benchmark. (While doing so, I fixed a few issues with the script.). The difference of this PR on this benchmark is not noticeable. I would only expect a performance difference on very small datasets. Still, I believe the benefits outweigh the costs. Side note: The test_pickle_load failed for me locally when cuda_available was set to False. I'm not exactly sure what the reason is, it could be that the way we patch torch.cuda.is_available breaks with some recent changes in PyTorch.

thomasjpfan

Looking at code around GitHub, there is some code that subclass NeuralNet and override run_single_epoch. For example:

https://github.com/shivar2/MotorImagery-Classification/blob/deaf89cb77b02b3033595e21bb54b0607e4b4d1a/Code/Classifier/EEGTLClassifier.py#L28

https://github.com/kqf/coin-detector/blob/29c26bd34489a11a381e995de13c5735cb1e6d4a/detectors/detnet.py#L35-L37

Passing in a different object into run_single_epoch would be BC breaking for the above. If one writes a custom fit_loop and passes the dataset into run_single_epoch, that would break as well.

Looking through the DataLoader source I do not see any costly operations. I agree it would be nice to initialize the loader once, but there is a significant BC cost.

BenjaminBossan · 2022-02-07T13:26:44Z

Thanks for finding these examples @thomasjpfan. I didn't think about searching github for instances (was that always possible?).

I'm thinking about possible mitigations to the BC issue.

We could catch the error and tell users to replace for batch in self.get_iterator(dataset, training=training): by for batch in iterator:.
We could catch the error, raise a warning, and then call run_single_epoch with the dataset instead (I don't like this)
Add a metaclass that checks if run_single_epoch is being overridden and then ... okay, this is getting ridiculous ;)
On top of that, we can add a section to the docs on how to change the code when migrating to skorch 0.12.

Any other proposals how to help users transition?

Looking through the DataLoader source I do not see any costly operations. I agree it would be nice to initialize the loader once, but there is a significant BC cost.

One more reason to initialize the loader only once is that it corresponds more closely to the "vanilla PyTorch fit loop" one can find in many tutorials (e.g. here). But yeah, the performance gain is minimal.

thomasjpfan · 2022-02-07T15:28:00Z

We could catch the error and tell users to replace for batch in self.get_iterator

For a custom run_single_epoch, I do not see a good place to catch the error. We can do it in fit_loop but we can not be sure that run_single_epoch will error when passed a dataset. We can assume a developer will use get_iterator in their run_single_epoch implementation and catch it there.

For a custom fit_loop and not run_single_epoch, we can catch in our implementation of run_single_epoch for dataset. If we want to be extra safe, then we do a deprecation cycle.

Technically I think it's possible to migrate, but I do not know if it is worth it.

BenjaminBossan · 2022-02-07T23:17:37Z

For a custom run_single_epoch, I do not see a good place to catch the error. We can do it in fit_loop but we can not be sure that run_single_epoch will error when passed a dataset. We can assume a developer will use get_iterator in their run_single_epoch implementation and catch it there.

I tried what would happen and the error I got was:

TypeError: 'DataLoader' object is not subscriptable

which is unsurprising. So the way to catch it would be to wrap

skorch/skorch/net.py

Lines 1087 to 1088 in c58ae67

    
           self.run_single_epoch(dataset_train, training=True, prefix="train", 
        
                                 step_fn=self.train_step, **fit_params)

inside a try ... except TypeError and inform the user about the mitigation.

Alternatively, we could type check the input to get_iterator. If it's already a DataLoader, we could give a DeprecationWarning with an instruction about the mitigation and then immediately return the "dataset" (which really is a DataLoader):

    def get_iterator(self, dataset, training=False):
        if isinstance(dataset, DataLoader):
            warnings.warn("helpful message", DeprecationWarning)
            return dataset
        # old code below

Technically I think it's possible to migrate, but I do not know if it is worth it.

The reason why I think it could be worth it is that someone who ports their vanilla PyTorch code to skorch should have as few gotchas as possible. Instantiating the DataLoader once per epoch instead of once per fit call would be one such gotcha that we could avoid. AFAICT, there is no technical reason why we need to instantiate it once per epoch, the linked examples don't make use of that functionality.

thomasjpfan · 2022-02-07T23:42:32Z

Let's go with wrapping get_iterator, going through deprecation cycle, and linking to a migration guide. The migration itself is simple enough.

Users who override run_single_epoch will not get an error but we give a DeprecationWarning, including an instruction on how to migrate.

BenjaminBossan · 2022-02-09T22:42:07Z

@thomasjpfan I pushed a change to make this PR backward compatible and added instructions on how to migrate.

thomasjpfan

LGTM

BenjaminBossan · 2022-02-13T23:00:37Z

@thomasjpfan I added a bit more documentation after your approval, I hope it's fine that I still merged.

After discussing with @ottonemo, he also mentioned that re-initializing the DataLoader might not be optimal when using it with multi-processing, so this could be another argument in favor of this switch.

In #835, we made a change to the data loader initilization. To keep backwards compatbility, we had to add extra code to get_iterator, together with a deprecation notice. The deprecation period is over, so that code is removed, as well as the associated test.

BenjaminBossan self-assigned this Feb 5, 2022

BenjaminBossan requested review from ottonemo and thomasjpfan February 5, 2022 16:27

thomasjpfan reviewed Feb 7, 2022

View reviewed changes

BenjaminBossan added 2 commits February 9, 2022 23:37

Make change backward compatible, add Deprecation

77936a8

Users who override run_single_epoch will not get an error but we give a DeprecationWarning, including an instruction on how to migrate.

Fix typo

44df7fa

BenjaminBossan requested a review from thomasjpfan February 9, 2022 22:42

thomasjpfan approved these changes Feb 13, 2022

View reviewed changes

Add note on how still to access the Dataset

c103d61

BenjaminBossan merged commit f5bb1a5 into master Feb 13, 2022

BenjaminBossan deleted the initialize-iterator-only-once-per-fit-call branch July 22, 2022 12:43

BenjaminBossan mentioned this pull request May 8, 2023

Release 0.13.0 #961

Merged

BenjaminBossan mentioned this pull request Aug 17, 2023

Enable using a generator as data loader #1011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize iterator only once per fit call #835

Initialize iterator only once per fit call #835

BenjaminBossan commented Feb 5, 2022

thomasjpfan left a comment

BenjaminBossan commented Feb 7, 2022

thomasjpfan commented Feb 7, 2022

BenjaminBossan commented Feb 7, 2022

thomasjpfan commented Feb 7, 2022

BenjaminBossan commented Feb 9, 2022

thomasjpfan left a comment

BenjaminBossan commented Feb 13, 2022

Initialize iterator only once per fit call #835

Initialize iterator only once per fit call #835

Conversation

BenjaminBossan commented Feb 5, 2022

Description

Implementation

Side note

thomasjpfan left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Feb 7, 2022

thomasjpfan commented Feb 7, 2022

BenjaminBossan commented Feb 7, 2022

thomasjpfan commented Feb 7, 2022

BenjaminBossan commented Feb 9, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Feb 13, 2022