Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware specific parts of Accelerator Refactoring #5719

Merged
merged 41 commits into from
Feb 1, 2021

Conversation

justusschock
Copy link
Member

@justusschock justusschock commented Jan 30, 2021

What does this PR do?

Adds the Hardware specific parts of the refactoring (#5616 ) as Accelerators for CPU, GPU and TPU, the accelerator connector and the specific plugins for single-device training, single tpu training and multiple tpu training (TPUSpawn).

Only files with new classes are added and some imports are changed. No integration into Trainer yet.

Should be merged after #5715 #5718 #5714

@pep8speaks
Copy link

pep8speaks commented Jan 30, 2021

Hello @justusschock! Thanks for updating this PR.

Line 383:13: W503 line break before binary operator

Comment last updated at 2021-02-01 12:37:53 UTC

justusschock and others added 10 commits January 31, 2021 17:06
Co-Authored with @awaelchi
Co-authored-by: @awaelchi
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-Authored with @awaelchi
Co-authored-by: @awaelchi
@codecov
Copy link

codecov bot commented Jan 31, 2021

Codecov Report

Merging #5719 (ea88661) into release/1.2-dev (3bacac7) will decrease coverage by 4%.
The diff coverage is 30%.

@@               Coverage Diff                @@
##           release/1.2-dev   #5719    +/-   ##
================================================
- Coverage               89%     86%    -4%     
================================================
  Files                  173     181     +8     
  Lines                12495   13110   +615     
================================================
+ Hits                 11175   11226    +51     
- Misses                1320    1884   +564     

@Borda Borda added the ready PRs ready to be merged label Jan 31, 2021
@Borda
Copy link
Member

Borda commented Feb 1, 2021

seems as last issue:
E ImportError: cannot import name 'ParallelLoader' from 'torch_xla' (/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch_xla/__init__.py)

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justusschock Really good work there ! Not all comments should be adressed in this PR.

pytorch_lightning/accelerators/accelerator_connector.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/accelerator_connector.py Outdated Show resolved Hide resolved

if self.is_global_zero:
# load weights saved in ddp
path = os.path.join(original_model.trainer.default_root_dir, "__temp_weight_distributed_end.ckpt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like __temp_weight_distributed_end.ckpt should be a GLOBAL property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by global property?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class

    TEMPORARY_WEIGHT_PATH = "__temp_weight_distributed_end.ckpt"
    def __init__

        
        ...

    def 

        path = os.path.join(original_model.trainer.default_root_dir, self.TEMPORARY_WEIGHT_PATH)


# load weights if not interrupted
# TODO: check for trainer reference
if self.on_colab_kaggle and not model.trainer.testing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a ColabEnv for TPU and move some logic there ? In another PR :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I completely understand this... You mean a ClusterEnvironment for that? We can certainly look into that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure Colab is a ClusterEnviromenent.

I meant more something like TPURunningEnviromenent -> Colab, Kaggle_colab, etc...

Best,
T.C

pytorch_lightning/plugins/training_type/tpu_spawn.py Outdated Show resolved Hide resolved
@tchaton tchaton self-requested a review February 1, 2021 09:58
@awaelchli awaelchli enabled auto-merge (squash) February 1, 2021 12:34
@awaelchli awaelchli merged commit b3ebc18 into release/1.2-dev Feb 1, 2021
@awaelchli awaelchli deleted the ref/hardware-specific branch February 1, 2021 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants