Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed custom BatchSamplers when instantiated in *_dataloader hook #13640

Merged
merged 38 commits into from
Jul 27, 2022

Conversation

otaj
Copy link
Contributor

@otaj otaj commented Jul 13, 2022

What does this PR do?

Allowed custom BatchSamplers when instantiated in *_dataloader hook

Fixes #13007, #11807

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @justusschock @awaelchli @ninginthecloud @rohitgr7 @otaj @akihironitta

@otaj otaj added the data handling Generic data-related topic label Jul 13, 2022
@otaj otaj added this to the pl:1.6.x milestone Jul 13, 2022
@otaj otaj requested a review from Borda as a code owner July 13, 2022 16:46
@otaj otaj self-assigned this Jul 13, 2022
@otaj otaj changed the title Adds support for custom BatchSampler Allowed custom BatchSamplers when instantiated in *_dataloader hook Jul 13, 2022
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really complex 😂

Posting a partial review

src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Show resolved Hide resolved
@otaj
Copy link
Contributor Author

otaj commented Jul 14, 2022

Thank you for the review! I addressed all your concerns 👍

Btw, regarding the complexity, it surprisingly wasn't too bad for me, since it's basically just logic from dataloader handling made generic

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Minor comments

src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/data.py Outdated Show resolved Hide resolved
tests/tests_pytorch/utilities/test_data.py Show resolved Hide resolved
tests/tests_pytorch/utilities/test_data.py Show resolved Hide resolved
tests/tests_pytorch/utilities/test_data.py Outdated Show resolved Hide resolved
@otaj otaj disabled auto-merge July 26, 2022 12:02
@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Jul 27, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Jul 27, 2022
@otaj
Copy link
Contributor Author

otaj commented Jul 27, 2022

@carmocca, @awaelchli I finally managed to properly debug the cause of the IPU regression. The problem was, that I was passing all default kwargs from torch.utils.data.DataLoader to poptorch.DataLoader which didn't play nicely with it. I don't think there's a substantial change from when you reviewed it (just split __pl_saved_kwargs into __pl_saved_kwargs and __pl_saved_default_kwargs).

@otaj otaj enabled auto-merge (squash) July 27, 2022 14:55
@otaj otaj merged commit 95f5f17 into master Jul 27, 2022
@otaj otaj deleted the bugfix/custom_batch_sampler branch July 27, 2022 15:32
@psridhar-asapp
Copy link

Hey @otaj I am confused on how to update my sampler to fit this API.

We are dealing with audio data and we want to batch samples by sequence length. In order to do this, we have shape files that are CSVs mapping between each audio sample and its length. Our BatchSampler reads these files, sorts utterances by length, and then constructs fixed batches of similarly long utterances such that all batches are similar in size.

I am not sure what to move to a Sampler here. We could move the loading of this shapes file to the sampler backed by a new dataset, but we do not sample from the shapes file to build batches, we consume the entire file at once.

I think part of my doubt is a lack of understanding on what exactly you do to the Sampler passed into the BatchSampler. I could move the loading of the shapes file inside a Dataset, pass in a Sampler to this dataset as a hacky way of giving the dataset to the BatchSampler, and totally ignore the Sampler in my BatchSampler logic while just reading from its member Dataset, but I imagine this will not solve the problem I'm having.

@otaj
Copy link
Contributor Author

otaj commented Jul 29, 2022

Hi @psridhar-asapp. First things first, this doesn't really add new API. It keeps the functionality as before, just now it allows your custom BatchSampler to have also more arguments, that don't appear in a signature of regular BatchSampler.

I remember we spoke on this topic in some other issue (I think it was #11807). I don't think we have a good resolution to it yet, honestly.

We don't really do much to your sampler. Everything important regarding Sampler and DataLoader is happening in this method (and of course, all the other methods it calls). https://github.com/Lightning-AI/lightning/blob/caaf35689c585fba1bb33b70243e7ddd2cf5d13c/src/pytorch_lightning/trainer/connectors/data_connector.py#L238

Currently, it sets shuffling if it was set to None, and we're training, wraps it in a distributed wrapper and does a bit more of a magic in _update_dataloader, which is just juggling arguments. The most important bit of call tree in _update_dataloader is honestly this function https://github.com/Lightning-AI/lightning/blob/caaf35689c585fba1bb33b70243e7ddd2cf5d13c/src/pytorch_lightning/utilities/data.py#L315

That is where we wrap either BatchSampler or Sampler in structures used for prediction or fault tolerance. I think (and think is the operative word here) it might be possible to remove the Sampler from being passed around and you could be happy with your BatchSampler only solution, but I don't really have a clear vision about that yet and it will require non-trivial engineering work.

If you don't care about either of these things, you can just expose argument sampler in your custom BatchSampler, (which needs to be a proper subclass of it and instantiated in *_dataloader hook) and don't use that sampler at all. I'm sorry, I don't have a better solution for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data handling Generic data-related topic pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation skipped when using custom batch sampler
6 participants