Build Wheels with NOVA #573

AbdBarho · 2022-12-09T12:41:09Z

What does this PR do?

Prototype building wheels with NOVA

Related to #533

NOVA complications

The default build matrix contains all accelerators, we only build for cuda, so we have to have an additional step to filter out what we don't need.
The default runners are github enterprise runners, which is good if you're facebook, but not good as an open source contributor. I had to build a way to override the runners.
- it might be that timeouts / OOMs won't happen with the default enterprise runners, but I cannot test this.
I could not find an easier way of defining environment variables for the job other than writing to BUILD_ENV_FILE which seems to be an internal thing.

Current state

The attached workflow has multiple problems that I don't know how to solve:

The linux build fails because of a timeout. timeout source, error log
The windows build fails because our setup.py cannot detect nvcc. I am unsure if we need to install cuda toolkit separately in windows, since this was not necessary in linux. workflow file, error log

I am open for suggestions

danthe3rd · 2022-12-09T13:53:04Z

Thanks a lot! I've pinged a few folks here for help on the issues you identified

kit1980 · 2022-12-09T18:54:49Z

The nvcc error because the machine doesn't have any Cuda-related stuff installed: https://github.com/actions/runner-images/blob/win22/20221204.3/images/win/Windows2022-Readme.md

osalpekar · 2022-12-09T18:58:15Z

Hi @AbdBarho, thanks for your feedback!

Quick thoughts:

I will increase the build timeout to allow your builds to go through. I hope this is enough to unblock you on the Linux side.
@kit1980 pointed out that due to the runner-type arg passed to your workflow, it is not using a GPU-enabled Windows machine, which is why nvcc cannot be found. At the moment, Linux/Mac/M1 builds are more stable than Windows (and I updated the wiki to reflect that now). If there are other issues with windows that we can't resolve quickly enough, I propose we proceed with migrating Linux first and then Windows a bit later.

Regarding the Nova complications:

Re only needing certain accelerators - this is a great point. Let me try to add some args so that you can disable rocm and cpu builds without writing a whole new workflow to filter these
Re runners - Linux/Windows jobs run on self-hosted AWS runners. Mac jobs run on GitHub hosted runners. Is there any issue with this setup?
Re env vars - Yes this is a gap. Let me try to provide an easier way of passing domain-specific env vars

AbdBarho · 2022-12-09T20:40:35Z

@kit1980 @osalpekar Thanks for the quick responses!

Situation

I will try to summarise the situation just to make sure we all have a common understanding:

Perhaps the most critical point that complicates things is that I am not part of the meta / pytorch team, and I don't have the rights to run any jobs on your self-hosted runners.

For a lack of a better solution, I overrode the runner suggestions in the build matrix of Nova to use the free github ones, which are of course much slower, and are probably the reason why the linux workflow is timing out.

As for windows, I assume your self-hosted gpu runners already ship with cuda, which would solve the problem. But again, as @kit1980 metioned, no cuda in the github windows runner.
(I assumed some part of the workflow would set it up, that was not the case. I did not get an error on linux because cuda is part of the container, not the runner's vm-image).

Of course, we can drop the windows build for a first version.

Workarounds

@osalpekar thank you for being proactive to fix the timeout issue in pytorch/test-infra#1267. I do think this would help in the short run, I am unsure for how long. I have seen xformers taking up to 3 hours on the free github runners when building for many cuda architectures. Therefore I believe that a powerful runner might still be necessary.

I see multiple paths going forward:

I can give up the development of this feature to someone on the team who has access and can use the self-hosted runners.
I get the rights to use the self-hosted runners.
Re-think the solution around the standard github runners, partially in this repo but also in pytorch/test-infra.

There are probably some other solutions that did not come to mind, so I would really like to hear your opinions and ideas!

Thanks again @osalpekar for understanding my pain points with nova, I am willing to contribute if needed. (assuming I don't need any access to any self-hosted runners, otherwise we are back to square one 😅)

osalpekar · 2022-12-09T21:57:05Z

@AbdBarho Thanks for the added context. I will see what I can do regarding the self-hosted runners access. If you can provide me the exact errors you saw when trying to run on those, it will be very helpful for me. In the meantime, here is what I propose for this PR:

Let's limit the scope to Linux for now. Seems like GHA not having Windows runners with GPUS is a hard blocker for Windows builds.
Let's leverage the changes in Support for Users to provide their own Env Vars to Build Pipelines pytorch/test-infra#1269 and Larger timeout for some use cases pytorch/test-infra#1267 to get as close as we can to a successful Linux build on the GitHub runners.
If you need a larger timeout, you can create a PR on test-infra changing the time limit, and when calling the base workflow, call build_wheels_linux.yml@<your_branch_on_test_infra> instead of build_wheels_linux.yml@main.

I hope this enables you to get as close as possible to completing this work with the GitHub-hosted runners. We can check this PR in, and somebody else can do the last-mile work of getting this working on our self-hosted runners. Let me know what you think!

AbdBarho · 2022-12-10T09:46:09Z

@osalpekar Sounds like a plan!

Regrading the error message I was seeing with the self-hosted runners, you can find a failed workflow here:
https://github.com/AbdBarho/xformers-wheels/actions/runs/3651668654

danthe3rd · 2022-12-12T10:19:20Z

The nvcc error because the machine doesn't have any Cuda-related stuff installed: https://github.com/actions/runner-images/blob/win22/20221204.3/images/win/Windows2022-Readme.md

Shouldn't we install NVCC if it's not already available? I don't think we have opensource runners on Windows with GPUs (and we can totally build for GPUs on a CPU-only instance).
For reference, this is the setup that @AbdBarho did for windows. It also includes a pagefile hack to increase available RAM to avoid an OOM. Could we have something similar in the default NOVA action?

xformers/.github/workflows/win-build.yml

Lines 29 to 50 in affe4da

    
           - name: (Windows) install cuda 
        
             uses: okazunori2013/cuda-toolkit@v0.3.2 
        
             with: 
        
               cuda: "11.7.0" 
        
               method: network 
        
           - name: (Windows) install python 
        
             uses: actions/setup-python@v4 
        
             with: 
        
               python-version: "3.7" 
        
           - name: (Windows) setup msvc 
        
             uses: ilammy/msvc-dev-cmd@v1 
        
           - name: configure Pagefile 
        
             uses: al-cheb/configure-pagefile-action@v1.2 
        
             with: 
        
               minimum-size: 8GB 
        
           # really unfortunate: https://github.com/ilammy/msvc-dev-cmd#name-conflicts-with-shell-bash 
        
           - name: (Windows) Remove link.exe 
        
             run: rm /usr/bin/link

AbdBarho · 2022-12-12T11:48:06Z

@danthe3rd I think I found a solution that works for the both of us without having to change much.

I have forked the infra repo and made all the changes necessary to be able to build the wheels without the self-hosted runners.

The idea is: when I am developing I reference my fork, but here we reference the main infra repo.

I believe that we won't need the timeout increase in pytorch/test-infra#1267, the self-hosted runners are that good.

You can see all the changes here: https://github.com/AbdBarho/test-infra/pull/1/files
And an example workflow run here: https://github.com/AbdBarho/xformers-wheels/actions/runs/3674429499/workflow

@danthe3rd Can you run this workflow? ideally before merging the branch? that would show us if my theory is correct, you will get errors when building on cpu or rocm, but cuda should work.

danthe3rd · 2022-12-12T12:14:50Z

Opened #580 to test this

I believe that we won't need the timeout increase in pytorch/test-infra#1267, the self-hosted runners are that good.

I'm not sure we have access to the self-hosted runners from pytorch, as it's under a different github organization

danthe3rd · 2022-12-12T15:07:02Z

All the github actions are stuck with:

Requested labels: linux.4xlarge.nvidia.gpu
Job defined at: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@refs/heads/main
Waiting for a runner to pick up this job...

It looks like we don't have access to those runners :/

AbdBarho · 2022-12-12T16:54:52Z

bummer...

Is there a plan to be part of the pytorch organization or get access to the runners anytime soon?

In any case, we could in theory continue using my fork of the infra repo with the standard runners, but I don't see any added value with this approach.

If we are going to maintain the ci code anyways, then we stick with the workflow we already have here, and put nova on the side for the time being.

what do you think?

AbdBarho · 2022-12-13T20:44:03Z

@danthe3rd I am going to close this for now.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2022

osalpekar mentioned this pull request Dec 9, 2022

Larger timeout for some use cases pytorch/test-infra#1267

Open

osalpekar mentioned this pull request Dec 9, 2022

Support for Users to provide their own Env Vars to Build Pipelines pytorch/test-infra#1269

Closed

Build Wheels with NOVA

f3b54da

AbdBarho force-pushed the main branch from 5609c0e to f3b54da Compare December 12, 2022 11:48

danthe3rd mentioned this pull request Dec 12, 2022

[DO_NOT_MERGE] Build Wheels with NOVA #580

Closed

AbdBarho closed this Dec 13, 2022

AbdBarho mentioned this pull request Dec 13, 2022

Publish Wheels to PyPi #588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Wheels with NOVA #573

Build Wheels with NOVA #573

AbdBarho commented Dec 9, 2022 •

edited

Loading

danthe3rd commented Dec 9, 2022

kit1980 commented Dec 9, 2022

osalpekar commented Dec 9, 2022

AbdBarho commented Dec 9, 2022 •

edited

Loading

osalpekar commented Dec 9, 2022

AbdBarho commented Dec 10, 2022

danthe3rd commented Dec 12, 2022

AbdBarho commented Dec 12, 2022

danthe3rd commented Dec 12, 2022

danthe3rd commented Dec 12, 2022

AbdBarho commented Dec 12, 2022

AbdBarho commented Dec 13, 2022

Build Wheels with NOVA #573

Build Wheels with NOVA #573

Conversation

AbdBarho commented Dec 9, 2022 • edited Loading

What does this PR do?

NOVA complications

Current state

danthe3rd commented Dec 9, 2022

kit1980 commented Dec 9, 2022

osalpekar commented Dec 9, 2022

AbdBarho commented Dec 9, 2022 • edited Loading

Situation

Workarounds

osalpekar commented Dec 9, 2022

AbdBarho commented Dec 10, 2022

danthe3rd commented Dec 12, 2022

AbdBarho commented Dec 12, 2022

danthe3rd commented Dec 12, 2022

danthe3rd commented Dec 12, 2022

AbdBarho commented Dec 12, 2022

AbdBarho commented Dec 13, 2022

AbdBarho commented Dec 9, 2022 •

edited

Loading

AbdBarho commented Dec 9, 2022 •

edited

Loading