Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Lightning Trainer/PyTorch Lightning 1.7.0 support + CI Fixes (JIT Tracing and Functions to Classes conversion) #1410

Merged
merged 33 commits into from
Aug 12, 2022

Conversation

krshrimali
Copy link
Contributor

@krshrimali krshrimali commented Aug 4, 2022

What does this PR do?

This PR adds support for PL 1.7.0 to Flash, and fixes a few issues with the CI. Summary:

  • CI has been failing for Flash CI for the case when a function is converted to ClassFromFunction class. We were not checking if its type is subclass of CustomClassFunctionBase.
  • For JIT tracing, the model needs to be attached to the Trainer class, else it will give an error similar to: ImageClassifier is not attached to Trainer.
  • Baal integration fails on recent Lightning Trainer versions. on_predict_dataloader, on_train_dataloader, on_test_dataloader were deprecated, and removed in PL 1.7. I'm not very sure if we need them, removing that works?
  • For BaaL integration, probabilities need to have a check if it's not empty else torch.cat will fail.
  • In PL 1.7.0, DataLoader class won't change the attributes except when they are in the __init__ method. Flash changes the collate function after the DataLoader is initialized - this was causing multiple errors. This PR fixes it.

This PR attempts to fix both of the issues. Note that we should document that the best way to JIT trace the model in Flash is to:

# 2. Build the task
model = ImageClassifier(backbone="resnet18", labels=datamodule.labels)

# 3. Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())

# Attach trainer to the model object
model.trainer = trainer
trace = torch.jit.trace(model, torch.randn((3, 196, 196))

Issue: #1413

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? [not needed for typos/docs]
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@krshrimali krshrimali changed the title Fix Flash CI: in case of converting functions to classes Fix Flash CLI: in case of converting functions to classes Aug 4, 2022
@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #1410 (8e03fd7) into master (8737ee8) will decrease coverage by 0.00%.
The diff coverage is 85.00%.

@@            Coverage Diff             @@
##           master    #1410      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.01%     
==========================================
  Files         286      286              
  Lines       12857    12861       +4     
==========================================
+ Hits        11799    11802       +3     
- Misses       1058     1059       +1     
Flag Coverage Δ
unittests 91.76% <85.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
flash/core/utilities/lightning_cli.py 98.16% <ø> (ø)
flash/core/integrations/icevision/adapter.py 94.78% <80.00%> (-1.52%) ⬇️
...ash/image/classification/integrations/baal/data.py 87.96% <100.00%> (ø)
...ash/image/classification/integrations/baal/loop.py 94.81% <100.00%> (-0.12%) ⬇️
flash/image/embedding/model.py 98.48% <100.00%> (ø)
flash/image/embedding/vissl/hooks.py 98.24% <100.00%> (ø)
flash/text/question_answering/model.py 94.55% <0.00%> (+0.68%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@krshrimali krshrimali marked this pull request as draft August 4, 2022 03:03
@krshrimali krshrimali changed the title Fix Flash CLI: in case of converting functions to classes Fix Flash CLI: in case of converting functions to classes + JIT tracing tests Aug 4, 2022
@krshrimali krshrimali mentioned this pull request Aug 5, 2022
4 tasks
model.eval()

model.trainer = trainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, anything wrong with the current workaround?

@krshrimali krshrimali marked this pull request as ready for review August 12, 2022 10:41
@krshrimali
Copy link
Contributor Author

About the 5 failures left in the CI:

  1. CI testing / pytest (ubuntu-20.04, 3.9, latest, stable, graph) (pull_request): This failure is flaky, and has been fixed in PL. Once there is another release of PL, we should be good.
  2. CI testing / pytest (ubuntu-20.04, 3.9, latest, stable, audio) (pull_request): PR Remove sd2 sound extensions from supported audio extensions #1409 is opened to fix it.
  3. Lightning-AI.lightning-flash (Special): Horovod versioning error, but I also see /usr/bin/pip not found error. I plan to fix this in a separate PR, doesn't seem to be very urgent as the first step was to support PL 1.7 with Flash.

Thanks to @otaj for his help with a few failures related to PL 1.7, it was very helpful. ❤️ Also @rohitgr7 for a fix! 🔥

cc: @ethanwharris @Borda @carmocca

Copy link
Collaborator

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @krshrimali ! Some very painful bugs here. Couple of suggestions. Also, let's add some things to the CHANGELOG here 😃

Comment on lines 113 to 132
def change_collate_fn_dataloader(self, data_loader):
# Starting PL 1.7.0 - changing attributes after the DataLoader is initialized - will not work
# So we manually set the collate_fn to the wrapped version, for now.
new_kwargs = getattr(data_loader, "__pl_saved_kwargs", None)
new_collate_fn = functools.partial(self._wrap_collate_fn, data_loader.collate_fn)
if new_kwargs:
new_kwargs["collate_fn"] = new_collate_fn
setattr(data_loader, "__pl_saved_kwargs", new_kwargs)
data_loader.collate_fn = new_collate_fn
return data_loader

def update_collate_fn_dataloader(self, new_collate_fn, data_loader):
# Starting PL 1.7.0 - changing attributes after the DataLoader is initialized - will not work
# So we manually update the collate_fn for the dataloader, for now.
new_kwargs = getattr(data_loader, "__pl_saved_kwargs", None)
if new_kwargs:
new_kwargs["collate_fn"] = new_collate_fn
setattr(data_loader, "__pl_saved_kwargs", new_kwargs)
data_loader.collate_fn = new_collate_fn
return data_loader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of duplication. Seems like change_collate_fn_dataloader is equivalent to:

self.update_collate_fn_dataloader(functools.partial(self._wrap_collate_fn, data_loader.collate_fn), data_loader)

Maybe better to just do that and remove change_collate_fn_dataloader

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwharris, you are right, this could be merged into only one function. Btw, while technically not a bug on PL side, it is definitely a regression on user experience regarding dataloaders and I'm sorry @krshrimali had to run into that. I suspect though, he will not be the last one, so I will send a PR today/early next week to PL to improve UX in there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwharris - Agreed. Fixed!

@otaj - Thank you for your help with this, please don't be sorry - looking forward to your PR which improves the UX further. Thank you again!

flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/utilities/imports.py Outdated Show resolved Hide resolved
@@ -33,6 +33,7 @@ jobs:
- { os: 'ubuntu-20.04', python-version: 3.9, requires: 'latest', release: 'pre', topic: [ 'core' ] }
- { os: 'ubuntu-20.04', python-version: 3.9, requires: 'latest', release: 'stable', topic: [ 'image' ] }
- { os: 'ubuntu-20.04', python-version: 3.9, requires: 'latest', release: 'stable', topic: [ 'image','image_extras' ] }
- { os: 'ubuntu-20.04', python-version: 3.9, requires: 'latest', release: 'stable', topic: [ 'image','image_extras_effdet' ] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have no effect as it will lead to just image deps being installed and effdet is installed only for serving. I think it would be cleaner to not add a new job but just set the effdet version in requirements/datatype_image_extras to be the one from github since we don't expect users to install the image_extras extras anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, should have mentioned - I tried putting the version of effdet in datatype_image_extras.txt and doing pip install -r requirements/datatype_image_extras.txt fails miserably :/ Lots of version conflicts. Hence I thought to create a new file, and put effdet there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the offline conversation with @ethanwharris - we decided to limit PyTorch version to <1.12 until icevision supports effdet 0.3.0.

.github/workflows/ci-testing.yml Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
flash/core/integrations/icevision/adapter.py Outdated Show resolved Hide resolved
@krshrimali krshrimali changed the title Fix Flash CLI: in case of converting functions to classes + JIT tracing tests Lightning Trainer/PyTorch Lightning 1.7.0 support + CI Fixes (JIT Tracing and Functions to Classes conversion) Aug 12, 2022
@krshrimali
Copy link
Contributor Author

Hi, @ethanwharris - The 5 failures are expected in the CI, only 1 of these is something that I haven't investigated yet. Fixes for others are either already merged in PL, or we have a PR opened in Flash. Could you please take a look again, and help merge if it all looks good? Thanks!

Copy link
Collaborator

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

.azure-pipelines/testing-template.yml Outdated Show resolved Hide resolved
@@ -91,7 +91,7 @@ def on_run_start(self, *args: Any, **kwargs: Any) -> None:
assert isinstance(self.trainer.datamodule, ActiveLearningDataModule)
if self._datamodule_state_dict is not None:
self.trainer.datamodule.load_state_dict(self._datamodule_state_dict)
self.trainer.predict_loop._return_predictions = True
self.trainer.predict_loop.return_predictions = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should figure out what to do with the baal loop. It is so bound to a particular PL and Baal version that I'm not sure it makes sense to have it as part of a framework. Maybe it would be better as a tutorial or in bolts? cc @otaj

@ethanwharris ethanwharris merged commit 2657315 into master Aug 12, 2022
@ethanwharris ethanwharris deleted the flash-ci/functions-to-classes branch August 12, 2022 13:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants