-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build Wheels with NOVA #573
Conversation
Thanks a lot! I've pinged a few folks here for help on the issues you identified |
The nvcc error because the machine doesn't have any Cuda-related stuff installed: https://github.com/actions/runner-images/blob/win22/20221204.3/images/win/Windows2022-Readme.md |
Hi @AbdBarho, thanks for your feedback! Quick thoughts:
Regarding the Nova complications:
|
@kit1980 @osalpekar Thanks for the quick responses! SituationI will try to summarise the situation just to make sure we all have a common understanding: Perhaps the most critical point that complicates things is that I am not part of the meta / pytorch team, and I don't have the rights to run any jobs on your self-hosted runners. For a lack of a better solution, I overrode the runner suggestions in the build matrix of Nova to use the free github ones, which are of course much slower, and are probably the reason why the linux workflow is timing out. As for windows, I assume your self-hosted gpu runners already ship with cuda, which would solve the problem. But again, as @kit1980 metioned, no cuda in the github windows runner. Of course, we can drop the windows build for a first version. Workarounds@osalpekar thank you for being proactive to fix the timeout issue in pytorch/test-infra#1267. I do think this would help in the short run, I am unsure for how long. I have seen xformers taking up to 3 hours on the free github runners when building for many cuda architectures. Therefore I believe that a powerful runner might still be necessary. I see multiple paths going forward:
There are probably some other solutions that did not come to mind, so I would really like to hear your opinions and ideas! Thanks again @osalpekar for understanding my pain points with nova, I am willing to contribute if needed. (assuming I don't need any access to any self-hosted runners, otherwise we are back to square one 😅) |
@AbdBarho Thanks for the added context. I will see what I can do regarding the self-hosted runners access. If you can provide me the exact errors you saw when trying to run on those, it will be very helpful for me. In the meantime, here is what I propose for this PR:
I hope this enables you to get as close as possible to completing this work with the GitHub-hosted runners. We can check this PR in, and somebody else can do the last-mile work of getting this working on our self-hosted runners. Let me know what you think! |
@osalpekar Sounds like a plan! Regrading the error message I was seeing with the self-hosted runners, you can find a failed workflow here: |
Shouldn't we install NVCC if it's not already available? I don't think we have opensource runners on Windows with GPUs (and we can totally build for GPUs on a CPU-only instance). xformers/.github/workflows/win-build.yml Lines 29 to 50 in affe4da
|
@danthe3rd I think I found a solution that works for the both of us without having to change much. I have forked the infra repo and made all the changes necessary to be able to build the wheels without the self-hosted runners. The idea is: when I am developing I reference my fork, but here we reference the main infra repo. I believe that we won't need the timeout increase in pytorch/test-infra#1267, the self-hosted runners are that good. You can see all the changes here: https://github.com/AbdBarho/test-infra/pull/1/files @danthe3rd Can you run this workflow? ideally before merging the branch? that would show us if my theory is correct, you will get errors when building on cpu or rocm, but cuda should work. |
Opened #580 to test this
I'm not sure we have access to the self-hosted runners from pytorch, as it's under a different github organization |
All the github actions are stuck with:
It looks like we don't have access to those runners :/ |
bummer... Is there a plan to be part of the pytorch organization or get access to the runners anytime soon? In any case, we could in theory continue using my fork of the infra repo with the standard runners, but I don't see any added value with this approach. If we are going to maintain the ci code anyways, then we stick with the workflow we already have here, and put nova on the side for the time being. what do you think? |
@danthe3rd I am going to close this for now. |
What does this PR do?
Prototype building wheels with NOVA
Related to #533
NOVA complications
BUILD_ENV_FILE
which seems to be an internal thing.Current state
The attached workflow has multiple problems that I don't know how to solve:
setup.py
cannot detectnvcc
. I am unsure if we need to install cuda toolkit separately in windows, since this was not necessary in linux. workflow file, error logI am open for suggestions