Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split callbacks #849

Merged
merged 28 commits into from
Feb 23, 2020
Merged

Split callbacks #849

merged 28 commits into from
Feb 23, 2020

Conversation

hadim
Copy link
Contributor

@hadim hadim commented Feb 15, 2020

Following #776.

All callbacks are split into individual files. Small files are easier to read. It also makes tracking history changes easier.

@Borda Borda changed the title [DO NOT MERGE] Split callbacks [WIP] Split callbacks Feb 15, 2020
@Borda Borda added the feature Is an improvement or enhancement label Feb 15, 2020
@Borda Borda added this to the 0.6.1 milestone Feb 15, 2020
@pep8speaks
Copy link

pep8speaks commented Feb 16, 2020

Hello @hadim! Thanks for updating this PR.

Line 86:101: E501 line too long (112 > 100 characters)

Comment last updated at 2020-02-22 13:39:46 UTC

@hadim hadim changed the title [WIP] Split callbacks Split callbacks Feb 16, 2020
@hadim hadim requested a review from Borda February 16, 2020 20:55
@hadim
Copy link
Contributor Author

hadim commented Feb 16, 2020

Not sure about the failing CI jobs...

@Borda
Copy link
Member

Borda commented Feb 17, 2020

@jeremyjordan I would increase the PROFILER_OVERHEAD_MAX_TOLERANCE=0.001 ?

@jeremyjordan
Copy link
Contributor

@Borda sure thing, updated on #867

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this separation, just a few recommendations :]

pytorch_lightning/callbacks/__init__.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/callback.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@hadim hadim force-pushed the split_callbacks branch 3 times, most recently from 4dd0f91 to 9499527 Compare February 17, 2020 23:23
@hadim
Copy link
Contributor Author

hadim commented Feb 18, 2020

CIs are great but I never got spammed like that for a simple PR (not by reviewers by but CI services)!

@hadim hadim mentioned this pull request Feb 18, 2020
@Borda
Copy link
Member

Borda commented Feb 18, 2020

CIs are great but I never got spammed like that for a simple PR (not by reviewers by but CI services)!

@hadim the Typo bot is off (wrong choice) or which CI borders you? =)
For sure we do not want to discourage any contributior we we shall keep some quality level...

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PyTorchLightning/core-contributors does anyone have any idea how to avoid duplicating package versions if this adds conda environment config?

.pep8speaks.yml Show resolved Hide resolved
environment.yml Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
checkpoint_callback = ModelCheckpoint(filepath='my_path')
Trainer(checkpoint_callback=checkpoint_callback)

# saves checkpoints to my_path whenever 'val_loss' has a new min
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, having this comment above creating a saving callback would be better :]

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Feb 18, 2020

@hadim the refactoring is almost done, great work... My kindly suggestion: Could you please move to add conda environment (which is also good addition) to separate PRs so we do not slow down this one... #199

@hadim
Copy link
Contributor Author

hadim commented Feb 19, 2020

Ready to merge!

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 just as it is a large change I would like else from @PyTorchLightning/core-contributors to approve it...
@hadim GREAT job! pls, add this change to CHANGELOG...

@Borda Borda requested review from neggert and a team February 19, 2020 16:12
@Borda Borda added the ready PRs ready to be merged label Feb 19, 2020
@hadim
Copy link
Contributor Author

hadim commented Feb 19, 2020

Careful when you commit. flake8 failed... I'll fix it.

@Borda
Copy link
Member

Borda commented Feb 19, 2020

Careful when you commit. flake8 failed... I'll fix it.

yeah I was about to fix it, this is the drawback while editing in a browser (the only way how to touch the PR so far I know...)

@hadim
Copy link
Contributor Author

hadim commented Feb 19, 2020

@Borda I also fixed the formatting issues on CHANGELOG.

@Borda Borda mentioned this pull request Feb 20, 2020
@williamFalcon
Copy link
Contributor

Awesome PR!

Mind fixing the GPU tests? :)

___________________________________________________________________________________________ test_amp_single_gpu ____________________________________________________________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-5/test_amp_single_gpu0')

    def test_amp_single_gpu(tmpdir):
        """Make sure DDP + AMP work."""
        tutils.reset_seed()

        if not tutils.can_run_gpu_test():
            return

        hparams = tutils.get_hparams()
        model = LightningTestModel(hparams)

        trainer_options = dict(
            default_save_path=tmpdir,
            show_progress_bar=True,
            max_epochs=1,
            gpus=1,
            distributed_backend='ddp',
            precision=16
        )

>       tutils.run_model_test(trainer_options, model)

tests/test_amp.py:32:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/models/utils.py:63: in run_model_test
    result = trainer.fit(model)
pytorch_lightning/trainer/trainer.py:903: in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:162: in spawn
    process.start()
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/process.py:121: in start
    self._popen = self._Popen(self)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/context.py:283: in _Popen
    return Popen(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_spawn_posix.py:32: in __init__
    super().__init__(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_fork.py:19: in __init__
    self._launch(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_spawn_posix.py:47: in _launch
    reduction.dump(process_obj, fp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <SpawnProcess name='SpawnProcess-1' parent=25414 initial>, file = <_io.BytesIO object at 0x7f57fb2b93b0>, protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       AttributeError: Can't pickle local object 'ModelCheckpoint.__init__.<locals>.<lambda>'

../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/reduction.py:60: AttributeError
------------------------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------------------------
INFO:root:GPU available: True, used: True
INFO:root:VISIBLE GPUS: 0
INFO:root:Using 16bit precision.
-------------------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------------------
INFO     root:distrib_data_parallel.py:216 GPU available: True, used: True
INFO     root:distrib_data_parallel.py:264 VISIBLE GPUS: 0
INFO     root:auto_mix_precision.py:21 Using 16bit precision.
_____________________________________________________________________________________________ test_amp_gpu_ddp _____________________________________________________________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-5/test_amp_gpu_ddp0')

    def test_amp_gpu_ddp(tmpdir):
        """Make sure DDP + AMP work."""
        if not tutils.can_run_gpu_test():
            return

        tutils.reset_seed()
        tutils.set_random_master_port()

        hparams = tutils.get_hparams()
        model = LightningTestModel(hparams)

        trainer_options = dict(
            default_save_path=tmpdir,
            show_progress_bar=True,
            max_epochs=1,
            gpus=2,
            distributed_backend='ddp',
            precision=16
        )

>       tutils.run_model_test(trainer_options, model)

tests/test_amp.py:81:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/models/utils.py:63: in run_model_test
    result = trainer.fit(model)
pytorch_lightning/trainer/trainer.py:903: in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:162: in spawn
    process.start()
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/process.py:121: in start
    self._popen = self._Popen(self)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/context.py:283: in _Popen
    return Popen(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_spawn_posix.py:32: in __init__
    super().__init__(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_fork.py:19: in __init__
    self._launch(process_obj)
../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/popen_spawn_posix.py:47: in _launch
    reduction.dump(process_obj, fp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <SpawnProcess name='SpawnProcess-3' parent=25414 initial>, file = <_io.BytesIO object at 0x7f57fb5ce180>, protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       AttributeError: Can't pickle local object 'ModelCheckpoint.__init__.<locals>.<lambda>'

../../media/falcon_kcgscratch1/software/miniconda3/envs/pl4/lib/python3.8/multiprocessing/reduction.py:60: AttributeError

@williamFalcon williamFalcon added Failed GPU tests and removed ready PRs ready to be merged labels Feb 22, 2020
@hadim
Copy link
Contributor Author

hadim commented Feb 22, 2020

I can't easily run the test on multi GPUs. Looking a the traceback the error seems related to multiprocessing and pickle on ModelCheckpoint. I don't see where the error could be...

@Borda any idea?

@ethanwharris
Copy link
Member

ethanwharris commented Feb 22, 2020

I can't easily run the test on multi GPUs. Looking a the traceback the error seems related to multiprocessing and pickle on ModelCheckpoint. I don't see where the error could be...

Lambda functions can't be pickled. Distributed stuff needs to pickle the objects to move them on to the right machines. The solution here would be to remove the lambda function in the init of the ModelCheckpoint class, i.e. this line:

self.save_function = lambda x: None

should be removed. Hope that helps!

@hadim
Copy link
Contributor Author

hadim commented Feb 22, 2020

Done. Thanks @ethanwharris .

@Borda
Copy link
Member

Borda commented Feb 22, 2020

@hadim what was the fix? it seems that the same is happening in #833

@Borda Borda mentioned this pull request Feb 22, 2020
4 tasks
@hadim
Copy link
Contributor Author

hadim commented Feb 22, 2020

Removing the lambda function model_checkpoint. Look at 2244f8c

Note that I haven't tested it since I don't have easy access to a ddp machine.

@williamFalcon williamFalcon merged commit 89d5772 into Lightning-AI:master Feb 23, 2020
@hadim hadim deleted the split_callbacks branch February 23, 2020 02:46
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Apr 3, 2020
* add .vscode in .gitignore

* Split callbacks in individual files + add a  property to Callback for easy trainer instance access

* formatting

* Add a conda env file for quick and easy env setup to develop on PL

* Adress comments

* add fix to kth_best_model

* add some typing to callbacks

* fix typo

* add autopep8 config to pyproject.toml

* format again

* format

* fix toml

* fix toml again

* consistent max line length in all config files

* remove conda env file

* Update pytorch_lightning/callbacks/early_stopping.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* docstring

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* fix logic error

* format

* simplify if/else

* format

* fix linting issue in changelog

* edit changelog about new callback mechanism

* fix remaining formating issue on CHANGELOG

* remove lambda function because it's compatible with pickle (used during ddp)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants