[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

lisjin · 2022-06-28T17:28:08Z

What does this PR do?

Refactor LRA run_tasks.py so that it uses Pytorch Lightning as a trainer.

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

* minor cleanup; updated changelog * fixed mypy error * added checking for blocksparse availability Co-authored-by: Chris Yuan <christopheryuan@learnfair1490.h2.fair> Co-authored-by: Chris Yuan <christopheryuan@devfair0278.h2.fair>

lisjin · 2022-06-28T18:01:50Z

@dianaml0 @blefaudeux Hope it's okay to tag you both as reviewers based on the original task (T104292598).

blefaudeux · 2022-06-29T07:09:57Z

@dianaml0 @blefaudeux Hope it's okay to tag you both as reviewers based on the original task (T104292598).

sounds great, thank you for working on this !

xformers/benchmarks/LRA/code/model_wrapper.py

blefaudeux · 2022-07-01T21:14:33Z

xformers/benchmarks/LRA/run_tasks.py


    return model


-def build_training_setup(


oh wow, this is some sizeable cleanup.. thank you !

blefaudeux · 2022-07-01T21:15:55Z

xformers/benchmarks/LRA/run_tasks.py

+
+    # Training epochs
+    if accumu_steps > 1:
+        config_training["num_train_steps"] *= accumu_steps


I'm not sure that this is still required with lightning, it handles grad accumulation out of the box, right ?

I thought so too but based off this warning, it sounds like gradient accumulation coupled with DDP behaves differently. I'll look into it in more detail.

this is fine, it just means that prior to the optimizer step the gradients will not be in sync across the fleet, while it's the case without the accumulation. It's a useful warning if you were to peek into the gradients on a per-rank basis, and decide on something from that, in that case triggering the grad acc could mess with your logic. We're not doing that here, simply training over all the ranks, so the default lightning behaviour should be fine

Ah makes sense—thanks! Will get rid of this block then.

blefaudeux · 2022-07-01T21:20:02Z

Looks good to me @lisjin, thank you for all this work !

Quick follow up, sorry for the delay

If possible I would do a second pass, with your critical eye really, trying to comment/simplify/improve the overall code quality here while you're at it (to be clear, I'm really not criticizing this contribution, more the initial quality of this part of the codebase).
Would it also be possible to trigger a couple of jobs to check the results ? this combined with that used to be practical to generate the full LRA score matrix in one go on a slurm cluster, it would be great if that still works fine and we can cross check the results ?

codecov-commenter · 2022-07-03T02:44:18Z

Codecov Report

Merging #343 (a191fd3) into main (7fdb90d) will decrease coverage by 0.02%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
- Coverage   93.91%   93.89%   -0.03%     
==========================================
  Files          70       70              
  Lines        3960     3962       +2     
==========================================
+ Hits         3719     3720       +1     
- Misses        241      242       +1

Flag	Coverage Δ
Python	`93.89% <50.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
xformers/components/multi_head_dispatch.py	`97.00% <50.00%> (-0.96%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7fdb90d...a191fd3. Read the comment docs.

lisjin · 2022-07-05T18:43:51Z

Would it also be possible to trigger a couple of jobs to check the results ? this combined with that used to be practical to generate the full LRA score matrix in one go on a slurm cluster, it would be great if that still works fine and we can cross check the results ?

@blefaudeux I ran experiments for nystrom attention and the numbers look comparable to the ones reported in Nystromformer.

	test_accu_mean
retrieval	0.6384
image	0.3919
pathfinder32-curv_baseline	0.8597
pathfinder32-curv_contour_length_9	0.8018
pathfinder32-curv_contour_length_14	0.6592
text	0.6222

lisjin · 2022-07-05T18:46:42Z

xformers/factory/model_factory.py

@@ -277,7 +277,9 @@ def forward(

                # Apply the optional input masking
                if encoder_input_mask is not None:
-                    x += encoder_input_mask.unsqueeze(0).unsqueeze(-1)
+                    if x.dim() - encoder_input_mask.dim() > 1:


I had to add this check to avoid a tensor shape mismatch error.

oh, well done, thanks for fixing that

blefaudeux

LGTM, thanks a million @lisjin ! I'll defer to Diana and Francisco for another validation/landing, but I think that it's a lot cleaner indeed.. thanks for the validation runs also, great PR which was not trivial !

dianaml0

Thanks so much for your contribution! Great to have this improvement and that the LRA results have been validated!

dianaml0 · 2022-07-07T15:52:34Z

xformers/benchmarks/LRA/code/model_wrapper.py

    def __init__(self, config, model_name):
        super().__init__()

        config_model = config["model"]
+        self.config_training = config["training"]

        self.enable_amp = config["training"]["mixed_precision"]


Seems like this is no longer being used?

A little buried, but it's being used in configure_optimizers.

lisjin and others added 4 commits June 28, 2022 09:25

First attempt at PL trainer

328d54c

Blocksparse switch revisions (#342)

9d1dc80

* minor cleanup; updated changelog * fixed mypy error * added checking for blocksparse availability Co-authored-by: Chris Yuan <christopheryuan@learnfair1490.h2.fair> Co-authored-by: Chris Yuan <christopheryuan@devfair0278.h2.fair>

Finish PL refactor

d7b76b8

Fix coding style, remove unused imports

3aca9be

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2022

Fix flake8 error

4257066

dianaml0 self-requested a review June 28, 2022 18:02

lisjin added 2 commits July 1, 2022 10:16

Make isort happy

bd7638f

Let pre-commit handle formatting...

1ae0aa8

blefaudeux reviewed Jul 1, 2022

View reviewed changes

xformers/benchmarks/LRA/code/model_wrapper.py Outdated Show resolved Hide resolved

blefaudeux reviewed Jul 1, 2022

View reviewed changes

xformers/benchmarks/LRA/code/model_wrapper.py Outdated Show resolved Hide resolved

blefaudeux reviewed Jul 1, 2022

View reviewed changes

lisjin added 2 commits July 2, 2022 19:04

Add type hints, fix eval behavior

cb4dfa0

Merge branch 'main' into lvj/T104292598

a191fd3

Evaluate PL refactor with batch_submit.py

5d06c47

lisjin requested a review from blefaudeux July 5, 2022 18:43

lisjin commented Jul 5, 2022

View reviewed changes

blefaudeux requested a review from fmassa July 6, 2022 07:16

blefaudeux approved these changes Jul 6, 2022

View reviewed changes

dianaml0 approved these changes Jul 7, 2022

View reviewed changes

blefaudeux merged commit 769cfe3 into facebookresearch:main Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

lisjin commented Jun 28, 2022

lisjin commented Jun 28, 2022

blefaudeux commented Jun 29, 2022 •

edited

Loading

blefaudeux Jul 1, 2022

blefaudeux Jul 1, 2022

lisjin Jul 3, 2022

blefaudeux Jul 3, 2022

lisjin Jul 3, 2022

blefaudeux commented Jul 1, 2022 •

edited

Loading

codecov-commenter commented Jul 3, 2022

lisjin commented Jul 5, 2022

lisjin Jul 5, 2022

blefaudeux Jul 6, 2022

blefaudeux left a comment

dianaml0 left a comment

dianaml0 Jul 7, 2022

lisjin Jul 7, 2022

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

Conversation

lisjin commented Jun 28, 2022

What does this PR do?

Before submitting

PR review

lisjin commented Jun 28, 2022

blefaudeux commented Jun 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux commented Jul 1, 2022 • edited Loading

codecov-commenter commented Jul 3, 2022

Codecov Report

lisjin commented Jul 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux left a comment

Choose a reason for hiding this comment

dianaml0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux commented Jun 29, 2022 •

edited

Loading

blefaudeux commented Jul 1, 2022 •

edited

Loading