Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

Merged
merged 10 commits into from
Jul 8, 2022
Merged

[T104292598] Refactor the "LRA" training code -> Pytorch Lightning #343

merged 10 commits into from
Jul 8, 2022

Conversation

lisjin
Copy link
Contributor

@lisjin lisjin commented Jun 28, 2022

What does this PR do?

Refactor LRA run_tasks.py so that it uses Pytorch Lightning as a trainer.

Before submitting

  • Did you have fun?
    • Make sure you had fun coding 🙃
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you update the changelog? (if needed)
    • N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

lisjin and others added 4 commits June 28, 2022 09:25
* minor cleanup; updated changelog

* fixed mypy error

* added checking for blocksparse availability

Co-authored-by: Chris Yuan <christopheryuan@learnfair1490.h2.fair>
Co-authored-by: Chris Yuan <christopheryuan@devfair0278.h2.fair>
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2022
@lisjin
Copy link
Contributor Author

lisjin commented Jun 28, 2022

@dianaml0 @blefaudeux Hope it's okay to tag you both as reviewers based on the original task (T104292598).

@dianaml0 dianaml0 self-requested a review June 28, 2022 18:02
@blefaudeux
Copy link
Contributor

blefaudeux commented Jun 29, 2022

@dianaml0 @blefaudeux Hope it's okay to tag you both as reviewers based on the original task (T104292598).

sounds great, thank you for working on this !


return model


def build_training_setup(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow, this is some sizeable cleanup.. thank you !


# Training epochs
if accumu_steps > 1:
config_training["num_train_steps"] *= accumu_steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is still required with lightning, it handles grad accumulation out of the box, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too but based off this warning, it sounds like gradient accumulation coupled with DDP behaves differently. I'll look into it in more detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, it just means that prior to the optimizer step the gradients will not be in sync across the fleet, while it's the case without the accumulation. It's a useful warning if you were to peek into the gradients on a per-rank basis, and decide on something from that, in that case triggering the grad acc could mess with your logic. We're not doing that here, simply training over all the ranks, so the default lightning behaviour should be fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense—thanks! Will get rid of this block then.

@blefaudeux
Copy link
Contributor

blefaudeux commented Jul 1, 2022

Looks good to me @lisjin, thank you for all this work !

Quick follow up, sorry for the delay

  • If possible I would do a second pass, with your critical eye really, trying to comment/simplify/improve the overall code quality here while you're at it (to be clear, I'm really not criticizing this contribution, more the initial quality of this part of the codebase).
  • Would it also be possible to trigger a couple of jobs to check the results ? this combined with that used to be practical to generate the full LRA score matrix in one go on a slurm cluster, it would be great if that still works fine and we can cross check the results ?

@codecov-commenter
Copy link

Codecov Report

Merging #343 (a191fd3) into main (7fdb90d) will decrease coverage by 0.02%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
- Coverage   93.91%   93.89%   -0.03%     
==========================================
  Files          70       70              
  Lines        3960     3962       +2     
==========================================
+ Hits         3719     3720       +1     
- Misses        241      242       +1     
Flag Coverage Δ
Python 93.89% <50.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
xformers/components/multi_head_dispatch.py 97.00% <50.00%> (-0.96%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7fdb90d...a191fd3. Read the comment docs.

@lisjin
Copy link
Contributor Author

lisjin commented Jul 5, 2022

  • Would it also be possible to trigger a couple of jobs to check the results ? this combined with that used to be practical to generate the full LRA score matrix in one go on a slurm cluster, it would be great if that still works fine and we can cross check the results ?

@blefaudeux I ran experiments for nystrom attention and the numbers look comparable to the ones reported in Nystromformer.

test_accu_mean
retrieval 0.6384
image 0.3919
pathfinder32-curv_baseline 0.8597
pathfinder32-curv_contour_length_9 0.8018
pathfinder32-curv_contour_length_14 0.6592
text 0.6222

@lisjin lisjin requested a review from blefaudeux July 5, 2022 18:43
@@ -277,7 +277,9 @@ def forward(

# Apply the optional input masking
if encoder_input_mask is not None:
x += encoder_input_mask.unsqueeze(0).unsqueeze(-1)
if x.dim() - encoder_input_mask.dim() > 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this check to avoid a tensor shape mismatch error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, well done, thanks for fixing that

@blefaudeux blefaudeux requested a review from fmassa July 6, 2022 07:16
Copy link
Contributor

@blefaudeux blefaudeux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a million @lisjin ! I'll defer to Diana and Francisco for another validation/landing, but I think that it's a lot cleaner indeed.. thanks for the validation runs also, great PR which was not trivial !

Copy link
Contributor

@dianaml0 dianaml0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your contribution! Great to have this improvement and that the LRA results have been validated!

def __init__(self, config, model_name):
super().__init__()

config_model = config["model"]
self.config_training = config["training"]

self.enable_amp = config["training"]["mixed_precision"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is no longer being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little buried, but it's being used in configure_optimizers.

@blefaudeux blefaudeux merged commit 769cfe3 into facebookresearch:main Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants