Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Wheels with NOVA #573

Closed
wants to merge 1 commit into from
Closed

Conversation

AbdBarho
Copy link
Contributor

@AbdBarho AbdBarho commented Dec 9, 2022

What does this PR do?

Prototype building wheels with NOVA

Related to #533

NOVA complications

  • The default build matrix contains all accelerators, we only build for cuda, so we have to have an additional step to filter out what we don't need.
  • The default runners are github enterprise runners, which is good if you're facebook, but not good as an open source contributor. I had to build a way to override the runners.
    • it might be that timeouts / OOMs won't happen with the default enterprise runners, but I cannot test this.
  • I could not find an easier way of defining environment variables for the job other than writing to BUILD_ENV_FILE which seems to be an internal thing.

Current state

The attached workflow has multiple problems that I don't know how to solve:

  • The linux build fails because of a timeout. timeout source, error log
  • The windows build fails because our setup.py cannot detect nvcc. I am unsure if we need to install cuda toolkit separately in windows, since this was not necessary in linux. workflow file, error log

I am open for suggestions

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2022
@danthe3rd
Copy link
Contributor

Thanks a lot! I've pinged a few folks here for help on the issues you identified

@kit1980
Copy link
Contributor

kit1980 commented Dec 9, 2022

The nvcc error because the machine doesn't have any Cuda-related stuff installed: https://github.com/actions/runner-images/blob/win22/20221204.3/images/win/Windows2022-Readme.md

@osalpekar
Copy link

Hi @AbdBarho, thanks for your feedback!

Quick thoughts:

  • I will increase the build timeout to allow your builds to go through. I hope this is enough to unblock you on the Linux side.
  • @kit1980 pointed out that due to the runner-type arg passed to your workflow, it is not using a GPU-enabled Windows machine, which is why nvcc cannot be found. At the moment, Linux/Mac/M1 builds are more stable than Windows (and I updated the wiki to reflect that now). If there are other issues with windows that we can't resolve quickly enough, I propose we proceed with migrating Linux first and then Windows a bit later.

Regarding the Nova complications:

  • Re only needing certain accelerators - this is a great point. Let me try to add some args so that you can disable rocm and cpu builds without writing a whole new workflow to filter these
  • Re runners - Linux/Windows jobs run on self-hosted AWS runners. Mac jobs run on GitHub hosted runners. Is there any issue with this setup?
  • Re env vars - Yes this is a gap. Let me try to provide an easier way of passing domain-specific env vars

@AbdBarho
Copy link
Contributor Author

AbdBarho commented Dec 9, 2022

@kit1980 @osalpekar Thanks for the quick responses!

Situation

I will try to summarise the situation just to make sure we all have a common understanding:

Perhaps the most critical point that complicates things is that I am not part of the meta / pytorch team, and I don't have the rights to run any jobs on your self-hosted runners.

For a lack of a better solution, I overrode the runner suggestions in the build matrix of Nova to use the free github ones, which are of course much slower, and are probably the reason why the linux workflow is timing out.

As for windows, I assume your self-hosted gpu runners already ship with cuda, which would solve the problem. But again, as @kit1980 metioned, no cuda in the github windows runner.
(I assumed some part of the workflow would set it up, that was not the case. I did not get an error on linux because cuda is part of the container, not the runner's vm-image).

Of course, we can drop the windows build for a first version.

Workarounds

@osalpekar thank you for being proactive to fix the timeout issue in pytorch/test-infra#1267. I do think this would help in the short run, I am unsure for how long. I have seen xformers taking up to 3 hours on the free github runners when building for many cuda architectures. Therefore I believe that a powerful runner might still be necessary.

I see multiple paths going forward:

  • I can give up the development of this feature to someone on the team who has access and can use the self-hosted runners.
  • I get the rights to use the self-hosted runners.
  • Re-think the solution around the standard github runners, partially in this repo but also in pytorch/test-infra.

There are probably some other solutions that did not come to mind, so I would really like to hear your opinions and ideas!

Thanks again @osalpekar for understanding my pain points with nova, I am willing to contribute if needed. (assuming I don't need any access to any self-hosted runners, otherwise we are back to square one 😅)

@osalpekar
Copy link

@AbdBarho Thanks for the added context. I will see what I can do regarding the self-hosted runners access. If you can provide me the exact errors you saw when trying to run on those, it will be very helpful for me. In the meantime, here is what I propose for this PR:

  1. Let's limit the scope to Linux for now. Seems like GHA not having Windows runners with GPUS is a hard blocker for Windows builds.
  2. Let's leverage the changes in Support for Users to provide their own Env Vars to Build Pipelines pytorch/test-infra#1269 and Larger timeout for some use cases pytorch/test-infra#1267 to get as close as we can to a successful Linux build on the GitHub runners.
  3. If you need a larger timeout, you can create a PR on test-infra changing the time limit, and when calling the base workflow, call build_wheels_linux.yml@<your_branch_on_test_infra> instead of build_wheels_linux.yml@main.

I hope this enables you to get as close as possible to completing this work with the GitHub-hosted runners. We can check this PR in, and somebody else can do the last-mile work of getting this working on our self-hosted runners. Let me know what you think!

@AbdBarho
Copy link
Contributor Author

@osalpekar Sounds like a plan!

Regrading the error message I was seeing with the self-hosted runners, you can find a failed workflow here:
https://github.com/AbdBarho/xformers-wheels/actions/runs/3651668654

@danthe3rd
Copy link
Contributor

The nvcc error because the machine doesn't have any Cuda-related stuff installed: https://github.com/actions/runner-images/blob/win22/20221204.3/images/win/Windows2022-Readme.md

Shouldn't we install NVCC if it's not already available? I don't think we have opensource runners on Windows with GPUs (and we can totally build for GPUs on a CPU-only instance).
For reference, this is the setup that @AbdBarho did for windows. It also includes a pagefile hack to increase available RAM to avoid an OOM. Could we have something similar in the default NOVA action?

- name: (Windows) install cuda
uses: okazunori2013/cuda-toolkit@v0.3.2
with:
cuda: "11.7.0"
method: network
- name: (Windows) install python
uses: actions/setup-python@v4
with:
python-version: "3.7"
- name: (Windows) setup msvc
uses: ilammy/msvc-dev-cmd@v1
- name: configure Pagefile
uses: al-cheb/configure-pagefile-action@v1.2
with:
minimum-size: 8GB
# really unfortunate: https://github.com/ilammy/msvc-dev-cmd#name-conflicts-with-shell-bash
- name: (Windows) Remove link.exe
run: rm /usr/bin/link

@AbdBarho
Copy link
Contributor Author

@danthe3rd I think I found a solution that works for the both of us without having to change much.

I have forked the infra repo and made all the changes necessary to be able to build the wheels without the self-hosted runners.

The idea is: when I am developing I reference my fork, but here we reference the main infra repo.

I believe that we won't need the timeout increase in pytorch/test-infra#1267, the self-hosted runners are that good.

You can see all the changes here: https://github.com/AbdBarho/test-infra/pull/1/files
And an example workflow run here: https://github.com/AbdBarho/xformers-wheels/actions/runs/3674429499/workflow

@danthe3rd Can you run this workflow? ideally before merging the branch? that would show us if my theory is correct, you will get errors when building on cpu or rocm, but cuda should work.

@danthe3rd
Copy link
Contributor

Opened #580 to test this

I believe that we won't need the timeout increase in pytorch/test-infra#1267, the self-hosted runners are that good.

I'm not sure we have access to the self-hosted runners from pytorch, as it's under a different github organization

@danthe3rd
Copy link
Contributor

All the github actions are stuck with:

Requested labels: linux.4xlarge.nvidia.gpu
Job defined at: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@refs/heads/main
Waiting for a runner to pick up this job...

It looks like we don't have access to those runners :/

@AbdBarho
Copy link
Contributor Author

bummer...

Is there a plan to be part of the pytorch organization or get access to the runners anytime soon?

In any case, we could in theory continue using my fork of the infra repo with the standard runners, but I don't see any added value with this approach.

If we are going to maintain the ci code anyways, then we stick with the workflow we already have here, and put nova on the side for the time being.

what do you think?

@AbdBarho
Copy link
Contributor Author

@danthe3rd I am going to close this for now.

@AbdBarho AbdBarho closed this Dec 13, 2022
@AbdBarho AbdBarho mentioned this pull request Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants