Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to access datalaoders within configure_optimizers #10430

Closed
rohitgr7 opened this issue Nov 9, 2021 · 11 comments
Closed

Unable to access datalaoders within configure_optimizers #10430

rohitgr7 opened this issue Nov 9, 2021 · 11 comments
Labels
data handling Generic data-related topic refactor

Comments

@rohitgr7
Copy link
Contributor

rohitgr7 commented Nov 9, 2021

Proposed refactoring or deprecation

Before v1.5, the dataloader hooks were patched to model and were easily accessible within configure_optimizers to setup total training steps for scheduler. But now since they are no longer patched,
https://github.com/PyTorchLightning/pytorch-lightning/blob/0ed5e3dc8abcec40aacd64cc9175590bb1409759/pytorch_lightning/trainer/connectors/data_connector.py#L213-L224
they are no longer available directly if using a datamodule or dataloaders are passed directly to .fit. Neither they can be accessed using self.trainer.train_dataloaders because dataloaders are being loaded within Fit Loop.
https://github.com/PyTorchLightning/pytorch-lightning/blob/0ed5e3dc8abcec40aacd64cc9175590bb1409759/pytorch_lightning/loops/fit_loop.py#L194-L197

Motivation

I'd suggest these dataloaders should be available for users, no matter how to passed it during .fit.

Pitch

If possible we should load call configure_optimizers after loading the dataloaders for the first time within fit loop. Not sure if it will bring some complications and failures because we load the optimizers differently for deepspeed.

Additional context

As always, alternative suggestions/thoughts would be appreciated :)

cc: @karthikrangasai @awaelchli


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@rohitgr7 rohitgr7 added refactor data handling Generic data-related topic labels Nov 9, 2021
@awaelchli
Copy link
Contributor

@rohitgr7 Good observation.

Yes, this was indeed a somewhat breaking change and it was not fully intentional. However, one could also argue this was a bug or at least undefined behavior. Previously, if we called train_dataloader(), e.g., through the trainer, it would still have lead to Trainer calling it a second time later on when it got reset, if I recall correctly.

For the purpose of documentation, here is the sequence of hooks we exacute around configure_optimizers and *_dataloader calls:

https://github.com/PyTorchLightning/pytorch-lightning/blob/7fb277f260bf91137af1ac599ed2751bbf1d49e3/tests/models/test_hooks.py#L648-L663

With this issue request, either the configure_optimizers call would have to move several hooks later or the dataloader reset call would have to move earlier.

What would make this configuration more favorable than the current one? Isn't it equally valid to request having the optimizers available in the dataloader methods? If both need to be the case, then this is hitting the fundamental limits of the hook system in Lightning. A similar issue was faced in #8485.

All I'm saying is, if the hook order changes then we need to be 100% sure that the order we choose is more favorable than the other in most cases, and we need to be aware of the limitations.

@romovpa
Copy link

romovpa commented Nov 10, 2021

Hi! @rohitgr7 @awaelchli
I have a use case where I need to access the training data loader in configure_optimizers(), not vise-versa.

I'm working on integrating Lightning with Opacus (it enables training with differential privacy). This pull request demonstrates how PL user can add privacy by just wrapping optimizer in configure_optimizers().

The problem is, we need to access the training data to configure noise generation and privacy accounting properly. We can pass some stats about the training data as the model parameters, but this can easily lead to inconsistent optimizer configuration and as the result, incorrect privacy accounting.

My current solution touches _data_connector, which is a dirty hack:

    def configure_optimizers(self):
        optimizer = optim.SGD(self.parameters(), lr=self.lr, momentum=0)

        if self.enable_dp:
            # dirty introspection of the trainer instance to get the training data
            data_loader = (
                self.trainer._data_connector._train_dataloader_source.dataloader()
            )
            # transform (model, optimizer, dataloader) to DP-versions
            model, optimizer, dataloader = self.privacy_engine.make_private(
                self,
                optimizer,
                data_loader,
                ...
            )

        return optimizer

cc @ananthsub (To our discussion about using TrainingTypePlugin: I think this way of enabling DP by modifying configure_optimizers can be useful for advanced users and DP researchers, I would prefer to have both on the table)

@rohitgr7
Copy link
Contributor Author

With this issue request, either the configure_optimizers call would have to move several hooks later or the dataloader reset call would have to move earlier.

I think moving the configure_optimizers later would be better since moving the dataloader initialization call outside FitLoop is inconvenient plus we only initialize optimizers during .fit so moving this initialization inside Fit Loop might be better. Although that will require a lot of changes around sharded/deepspeed since they do their initialization and wrap them around in a different way. cc @SeanNaren

Isn't it equally valid to request having the optimizers available in the dataloader methods?

do we have any use-case/example for it? @awaelchli

@awaelchli
Copy link
Contributor

@rohitgr7 Would you like to try to invert the order in a draft? Failing tests should then reveal potential challenges we have not considered yet, if there are any.

@rohitgr7
Copy link
Contributor Author

yep! we do have some failing tests.

@andreimargeloiu
Copy link

Any update?

@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Feb 7, 2022

hey @andreimargeloiu !

the current workaround is that you can initialize the dataloader before it's initialized by lightning using:

def configure_optimizers(self):
    self.trainer.reset_train_dataloader()
    self.train_dataloader.loaders  # access it here.

@2533245542
Copy link

A bit confused here. So in the current pytorch lightning version, if I want to access the train dataloder, for example, in my model's forward method, I do train_loader = self.trainer.datamodule.train_dataloader()?

Looks like if I do this in model's def setup(self), the train dataloader will be loaded because it was not loaded before. But if I do this in model's forward method, will the dataloader be reloaded or it is just using a cached one?

Would be happy to see if there is a more elegant way of doing this.

@LarsHill
Copy link

The above workaround by @rohitgr7 does not sem to work anymore in pytorch-lightning >= 2.0.0, since the reset_train_dataloader() function does no longer exist.
I receive the error AttributeError: 'Trainer' object has no attribute 'reset_train_dataloader'.

Is there an other workaround or any plans to fix this? I think it is quite more common to access the dataset in configure_optimizer in order to easily get access to data loading parameters like batch size, number of total samples, etc. These information are often necessary for initializing learning rate schedulers.

@carmocca
Copy link
Contributor

In 2.0, you can do

def configure_optimizers(self):
    self.trainer.fit_loop.setup_data()
    dataloader = self.trainer.train_dataloader
        
    ...

You could also call self.train_dataloader() too if your LightningModule defines it

@carmocca
Copy link
Contributor

Closing as I don't think there's anything actionable. Feel free to post further questions though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants