support for native amp #1561

williamFalcon · 2020-04-22T16:22:07Z

@mcarilli mind taking a look?

Issue 1

We have a slight issue with the DP API...

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

# Alternatively
MyModel(nn.Module):
    ...
    def forward(self, input):
        with autocast():
            ...

@ethanwharris suggested a way around this which we have in the PR

original_fwd = model.forward
model.forward = autocast()(model.forward)

# train and stuff
# ...

model.forward = original_fwd

Issue 2

How do we save the state of the scaling factor to resume training?
@mcarilli

pytorch_lightning/core/lightning.py

pytorch_lightning/trainer/distrib_parts.py

Borda · 2020-04-22T16:43:23Z

since which version is amp in pytorch native?

williamFalcon · 2020-04-22T16:56:30Z

1.6. but we don’t need to explicitly check. we can test the properties as i did

pytorch_lightning/trainer/distrib_parts.py

mcarilli · 2020-04-22T18:34:04Z

How do we save the state of the scaling factor to resume training?

saved_state = scaler.state_dict()
scaler.load_state_dict(saved_state)

mcarilli · 2020-04-22T18:44:45Z

pytorch_lightning/core/hooks.py

@@ -138,11 +138,20 @@ def backward(self, use_amp, loss, optimizer):
                else:
                    loss.backward()

+        .. note:: with PyTorch 1.6+ + precision=16 + multiple optimizers, set .backward(retrain_graph=True)


You don't need this note.

The example is misleading, I guess. The retain_graph=True bit has nothing to do with Amp, it's only present because both losses interleave outputs from multiple models. Both backward passes use the same model graphs, so the first backward() must not tear them down. retain_graph=True would be necessary with or without Amp. That's unclear and maybe I should either change the example snippet so retain_graph=True is not needed, or add a comment clarifying that retain_graph=True there is not Amp-related.

mcarilli · 2020-04-22T18:45:58Z

pytorch_lightning/core/hooks.py

+                return
+
+            if self.trainer.use_native_amp:
+                # don't forget to retain graph on backward with multiple optimizers


also remove this comment, see https://github.com/PyTorchLightning/pytorch-lightning/pull/1561/files#r413230111.

pytorch_lightning/trainer/trainer.py

pytorch_lightning/core/hooks.py

mergify · 2020-04-22T21:40:47Z

This pull request is now in conflict... :(

williamFalcon · 2020-04-23T00:05:09Z

@Borda these tests are failing bc amp is not installed... did we remove amp?

williamFalcon · 2020-04-23T00:10:20Z

pytorch_lightning/trainer/training_io.py

@@ -281,6 +281,10 @@ def restore(self, checkpoint_path: str, on_gpu: bool):
        if on_gpu:
            model.cuda(self.root_gpu)

+        # restore amp scaling
+        if self.use_amp and self.use_native_amp and 'native_amp_scaling_state' in checkpoint:


@mcarilli sanity check this loading?

Looks good if you fix the saving https://github.com/PyTorchLightning/pytorch-lightning/pull/1561/files#r413418705

Like saving, loading should occur either at the very beginning of an iteration (before any training-related scaler calls for that iteration) or at the end of an iteration, after scaler.update(). It doesn't make a lot of sense to load state dicts at the end of an iteration, but if the saved state originated from a scaler.state_dict() call at the end of, say, iteration 1000 (i.e. after iteration 1000's call to scaler.update()), then it's ok to call load_state_dict at the beginning of iteration 1001 to resume.

williamFalcon · 2020-04-23T00:10:31Z

pytorch_lightning/trainer/training_io.py

@@ -316,6 +320,10 @@ def dump_checkpoint(self):

        checkpoint['state_dict'] = model.state_dict()

+        # restore native amp scaling
+        if self.use_amp and self.use_native_amp and 'native_amp_scaling_state' in checkpoint:


@mcarilli sanity check this saving?

state_dict is a method, as for modules and optimizers, so checkpoint['native_amp_scaling_state'] = self.scaler.state_dict() is what you want.
checkpoint['native_amp_scaling_state'] = self.scaler.state_dict would stash the bound-method object itself :P

Also you should make sure state_dict() is retrieved either at the very beginning of an iteration (before any scaler method calls) or at the very end (after scaler.update()), and that the model and optimizer state dicts are saved at that same spot.

I can't tell from these lines alone if the calling code occurs at a spot that obeys those criteria.

i thought it was a property haha, but i guess it's consistent with the other state_dict() calls haha

lol i see. it's consistent with the rest

Another thing to consider is that with torch.cuda.amp, it's permissible to

load a checkpoint from a model + optimizer not trained with Amp, and resume training with Amp enabled, or

load a checkpoint from a model + optimizer trained with Amp, and resume training without Amp.

I think your if criteria are flexible enough that both those cases can happen naturally with the appropriate user args but I'm not sure just from looking at it.

yeah this code works.

Case 1: Train with amp, load amp

works fine

case 2: Train amp, load and not use amp

in this case, lightning loads the amp state but amp is disabled so user doesn't use it at all

case 3: train regular, resume regular

works fine

case 4: train regular, resume with amp

in this case the checkpoint has no amp state and model starts normal but on amp.

mcarilli · 2020-04-23T00:40:40Z

pytorch_lightning/trainer/training_io.py

@@ -316,6 +320,10 @@ def dump_checkpoint(self):

        checkpoint['state_dict'] = model.state_dict()

+        # restore native amp scaling
+        if self.use_amp and self.use_native_amp and 'native_amp_scaling_state' in checkpoint:
+            checkpoint['native_amp_scaling_state'] = self.scaler.state_dict


checkpoint['native_amp_scaling_state'] = self.scaler.state_dict()

Borda · 2020-04-23T07:47:48Z

@Borda these tests are failing bc amp is not installed... did we remove amp?

probably, unfortunately, it happened here with Horovoed #1529 (comment)
APEX was removed in 9257b37

mergify · 2020-04-23T16:54:09Z

This pull request is now in conflict... :(

.drone.yml

codecov · 2020-04-23T18:32:53Z

Codecov Report

Merging #1561 into master will decrease coverage by 0%.
The diff coverage is 55%.

@@          Coverage Diff           @@
##           master   #1561   +/-   ##
======================================
- Coverage      89%     88%   -0%     
======================================
  Files          68      68           
  Lines        3913    3955   +42     
======================================
+ Hits         3473    3496   +23     
- Misses        440     459   +19

Saving was introduced in Lightning-AI#1561.

Saving was introduced in #1561.

mergify bot requested a review from a team April 22, 2020 16:22

williamFalcon commented Apr 22, 2020

View reviewed changes

pytorch_lightning/core/lightning.py Show resolved Hide resolved

williamFalcon requested review from ethanwharris and justusschock April 22, 2020 16:37

williamFalcon commented Apr 22, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_parts.py Show resolved Hide resolved

williamFalcon commented Apr 22, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_parts.py Show resolved Hide resolved

Borda added feature Is an improvement or enhancement priority: 0 High priority task labels Apr 22, 2020

Borda added this to the 0.7.4 milestone Apr 22, 2020

ethanwharris approved these changes Apr 22, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_parts.py Show resolved Hide resolved

mergify bot requested a review from a team April 22, 2020 17:40

mcarilli reviewed Apr 22, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

mcarilli reviewed Apr 22, 2020

View reviewed changes

pytorch_lightning/core/hooks.py Show resolved Hide resolved

williamFalcon force-pushed the apex branch from 0a20aee to 0ead459 Compare April 22, 2020 23:44

williamFalcon commented Apr 23, 2020

View reviewed changes

mcarilli reviewed Apr 23, 2020

View reviewed changes

williamFalcon added 4 commits April 23, 2020 12:58

adding native amp suppport

fd42641

adding native amp suppport

ea1650a

adding native amp suppport

807033d

adding native amp suppport

d698328

williamFalcon added 10 commits April 23, 2020 12:58

autocast

794df48

autocast

fb6e414

autocast

ba02a20

autocast

ee6299e

autocast

4d06040

autocast

199c96c

removed comments

2af9dc9

removed comments

afb6801

added state saving

fa87d1d

added state saving

c60a885

williamFalcon force-pushed the apex branch from dee42c8 to c60a885 Compare April 23, 2020 16:58

Borda and others added 2 commits April 23, 2020 19:37

try install amp again

de22e4f

added state saving

879e691

Borda reviewed Apr 23, 2020

View reviewed changes

.drone.yml Outdated Show resolved Hide resolved

mergify bot requested a review from a team April 23, 2020 18:14

drop Apex reinstall

60b9963

williamFalcon merged commit 29ebe92 into master Apr 23, 2020

Borda deleted the apex branch April 23, 2020 20:41

kepler added a commit to kepler/pytorch-lightning that referenced this pull request May 11, 2020

Fix saving native AMP scaler state

b6daf94

Saving was introduced in Lightning-AI#1561.

kepler mentioned this pull request May 11, 2020

Fix saving native AMP scaler state #1777

Merged

5 tasks

williamFalcon pushed a commit that referenced this pull request May 12, 2020

Fix saving native AMP scaler state (#1777)

d120f97

Saving was introduced in #1561.

mcarilli mentioned this pull request Jun 25, 2020

Added Horovod distributed backend #1529

Merged

ptrblck mentioned this pull request Jun 29, 2020

Convergence issues when using pytorch's native AMP pytorch/pytorch#38788

Open

Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for native amp #1561

support for native amp #1561

williamFalcon commented Apr 22, 2020 •

edited

Loading

Borda commented Apr 22, 2020

williamFalcon commented Apr 22, 2020

mcarilli commented Apr 22, 2020 •

edited

Loading

mcarilli Apr 22, 2020 •

edited

Loading

mcarilli Apr 22, 2020

mergify bot commented Apr 22, 2020

williamFalcon commented Apr 23, 2020

williamFalcon Apr 23, 2020

mcarilli Apr 23, 2020 •

edited

Loading

williamFalcon Apr 23, 2020

mcarilli Apr 23, 2020 •

edited

Loading

mcarilli Apr 23, 2020 •

edited

Loading

williamFalcon Apr 23, 2020 •

edited

Loading

mcarilli Apr 23, 2020

mcarilli Apr 23, 2020 •

edited

Loading

williamFalcon Apr 23, 2020

mcarilli Apr 23, 2020

Borda commented Apr 23, 2020 •

edited

Loading

mergify bot commented Apr 23, 2020

codecov bot commented Apr 23, 2020

support for native amp #1561

support for native amp #1561

Conversation

williamFalcon commented Apr 22, 2020 • edited Loading

Issue 1

Issue 2

Borda commented Apr 22, 2020

williamFalcon commented Apr 22, 2020

mcarilli commented Apr 22, 2020 • edited Loading

mcarilli Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

mcarilli Apr 22, 2020

Choose a reason for hiding this comment

mergify bot commented Apr 22, 2020

williamFalcon commented Apr 23, 2020

williamFalcon Apr 23, 2020

Choose a reason for hiding this comment

mcarilli Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon Apr 23, 2020

Choose a reason for hiding this comment

mcarilli Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

mcarilli Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

mcarilli Apr 23, 2020

Choose a reason for hiding this comment

mcarilli Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon Apr 23, 2020

Choose a reason for hiding this comment

Case 1: Train with amp, load amp

case 2: Train amp, load and not use amp

case 3: train regular, resume regular

case 4: train regular, resume with amp

mcarilli Apr 23, 2020

Choose a reason for hiding this comment

Borda commented Apr 23, 2020 • edited Loading

mergify bot commented Apr 23, 2020

codecov bot commented Apr 23, 2020

Codecov Report

williamFalcon commented Apr 22, 2020 •

edited

Loading

mcarilli commented Apr 22, 2020 •

edited

Loading

mcarilli Apr 22, 2020 •

edited

Loading

mcarilli Apr 23, 2020 •

edited

Loading

mcarilli Apr 23, 2020 •

edited

Loading

mcarilli Apr 23, 2020 •

edited

Loading

williamFalcon Apr 23, 2020 •

edited

Loading

mcarilli Apr 23, 2020 •

edited

Loading

Borda commented Apr 23, 2020 •

edited

Loading