Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TPU testing and collect all tests #11098

Merged
merged 18 commits into from
Jul 27, 2022
Merged

Fix TPU testing and collect all tests #11098

merged 18 commits into from
Jul 27, 2022

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Dec 16, 2021

What does this PR do?

Fixes #13720

Addresses a range of issues with the TPU CI:

  • Now collects all tests that are marked with RunIf(tpu=True). No longer do the tests get hardcoded and forgotten, leading to tests being added that never run.
  • Removed the @pl_multi_process decorator on all tests. This decorator suppressed exceptions and assertion errors. Some tests were broken and outdated for a while and were never raising the errors due to this.
  • Added standalone marker RunIf(tpu=True, standalone=True) for TPU tests as an alternative to the aforementioned pl_multi_process. The CI now runs standalone tests similar to the GPU test suite. This is necessary for example when we need to run with the single device tpu strategy that requires us to access the xla device in the main process.
  • Fixed tests that were affected by the aforementioned cause for failures.

Note, after resolving the core issues, and pushing many commits to run the CI, I have not seen any flakiness and random behavior anymore.

In a follow-up, we should update the CI to the latest torch_xla version.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

Part of #1 (it's a lie, this is just here to avoid noisy GitHub bot)

cc @Borda @tchaton @rohitgr7 @akihironitta @carmocca @kaushikb11

@awaelchli awaelchli added accelerator: tpu Tensor Processing Unit bug Something isn't working ci Continuous Integration labels Dec 17, 2021
@awaelchli awaelchli added this to the 1.5.x milestone Dec 17, 2021
@awaelchli awaelchli added the priority: 0 High priority task label Dec 17, 2021
@awaelchli awaelchli mentioned this pull request Dec 17, 2021
12 tasks
tests/lite/test_lite.py Outdated Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Dec 18, 2021

I think that in past we had just a selection of files to have it faster, but I agree that this is a more robust solution - we test all PL and we do not need to remember to add a single file somewhere...

@mergify mergify bot added the ready PRs ready to be merged label Dec 18, 2021
tests/helpers/runif.py Outdated Show resolved Hide resolved
@kaushikb11 kaushikb11 mentioned this pull request Dec 20, 2021
12 tasks
@awaelchli
Copy link
Contributor Author

TPU ci is still getting stuck/ timing out. One of the newly discovered tests must be the cause of this.

@tchaton
Copy link
Contributor

tchaton commented Jan 4, 2022

TPU ci is still getting stuck/ timing out. One of the newly discovered tests must be the cause of this.

Any progress on this? Did you identify the hanging test?

@mergify mergify bot added the has conflicts label Jan 4, 2022
@kaushikb11
Copy link
Contributor

Any progress on this? Did you identify the hanging test?

I will take a stab at it today.

@mergify mergify bot removed the ready PRs ready to be merged label Jul 25, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Jul 25, 2022
@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Jul 27, 2022
@kaushikb11 kaushikb11 enabled auto-merge (squash) July 27, 2022 15:07
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Jul 27, 2022
@kaushikb11 kaushikb11 merged commit fff62f0 into master Jul 27, 2022
@kaushikb11 kaushikb11 deleted the ci/run-all-tpu-tests branch July 27, 2022 15:40
@awaelchli awaelchli mentioned this pull request Aug 5, 2022
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working ci Continuous Integration pl Generic label for PyTorch Lightning package priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The TPU issues in Lightning
6 participants