Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nightly build workflow #250

Merged
merged 33 commits into from
May 20, 2024
Merged

Add nightly build workflow #250

merged 33 commits into from
May 20, 2024

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented May 16, 2024

Another attempt to properly use Nova wheel build workflow because #249 doesn't seem to build the correct wheel.

This requires pytorch/test-infra#5217 to handle the pypi upload.

Copy link

pytorch-bot bot commented May 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/250

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8048bc7 with merge base e9e5fae (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024
Copy link

pytorch-bot bot commented May 16, 2024

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@huydhn huydhn marked this pull request as ready for review May 17, 2024 00:56
@huydhn huydhn requested a review from msaroufim May 17, 2024 00:56
@msaroufim
Copy link
Member

Thanks Huy! Am I reading this correctly that this is still not a manylinux binary? https://github.com/pytorch/ao/actions/runs/9122140883/job/25082459762?pr=250#step:14:538

Also re the upload where will this upload? It seems like the pytorch s3 bucket? s3://pytorch/whl/nightly/cu124/torchao_nightly-2024.5.17-cp310-cp310-linux_x86_64.whl does this get copied in a cron job to https://download.pytorch.org/whl/nightly/torchao/ ?

@huydhn
Copy link
Contributor Author

huydhn commented May 17, 2024

Thanks Huy! Am I reading this correctly that this is still not a manylinux binary? https://github.com/pytorch/ao/actions/runs/9122140883/job/25082459762?pr=250#step:14:538

I'm asking myself the same question. I think the workflow does its job expectedly because I can see PyTorch nightly binaries in the same format https://download.pytorch.org/whl/nightly/torch/ or https://download.pytorch.org/whl/nightly/torchvision. PyTorch wheels go through this script before being uploaded to pypi https://github.com/pytorch/builder/blob/main/release/promote/wheel_to_pypi.sh#L45-L59, so I'm pretty sure that this is where it preps and renames the file so that it becomes manylinux1 on PYPI, i.e. https://pypi.org/project/torch/2.3.0/#files

Also re the upload where will this upload? It seems like the pytorch s3 bucket? s3://pytorch/whl/nightly/cu124/torchao_nightly-2024.5.17-cp310-cp310-linux_x86_64.whl does this get copied in a cron job to https://download.pytorch.org/whl/nightly/torchao/ ?

Yeah, the file will be uploaded to https://download.pytorch.org/whl/nightly under different CUDA arch, for example https://download.pytorch.org/whl/nightly/cu121. Their installation instruction will be pip install torchao --index-url https://download.pytorch.org/whl/nightly/cu121 or pip install torchao --index-url https://download.pytorch.org/whl/nightly (it defaults to cu121 IIRC)

@huydhn
Copy link
Contributor Author

huydhn commented May 17, 2024

About the script https://github.com/pytorch/builder/blob/main/release/promote/wheel_to_pypi.sh#L45-L59, I will check with @atalman and follow up on this tomorrow, we have several choices:

@huydhn
Copy link
Contributor Author

huydhn commented May 17, 2024

@msaroufim Could you help confirm that the nightly binaries are correct https://pypi.org/project/torchao-nightly/2024.5.17/#files?

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

The main more general questions I have are

It seems like this script already also does official releases by switching this variable here to anything except a value including the word nightly

     ref:
        description: Reference to checkout
        default: nightly

And the second question is I just did a manual validation of the nightly binaries by running

pip install torchao-nightly==2024.5.17
python
import torchao.ops
dir(torch.ops)

And then

pytest test/test_ops.py .......      

So it seems like I did successfully compile and use CUDA binaries that were built on a different machine than mine which means that the cuda manylinux support is working and that's awesome! Does this give us enough confidence the manylinux support will continue working or should we also have binary validation where we use test machines that are different from the build machines

packaging/post_build_script.sh Show resolved Hide resolved

pip install setuptools wheel twine auditwheel

# NB: This will make the nightly wheel compatible with the latest stable torch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the correct release strategy might be

  1. Nightly ao with nightly torch
  2. Release ao with release torch

So at least to merge this let's use the nightly torch versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the fix is very simple, we just need to remove the step to install torch here. The build workflow has already installed the correct torch version, i.e. nightly/release torch for nightly/release ao respectively

packaging/smoke_test.py Show resolved Hide resolved
@huydhn
Copy link
Contributor Author

huydhn commented May 18, 2024

It seems like this script already also does official releases by switching this variable here to anything except a value including the word nightly

Yes, that's the benefit of using Nova build workflow. Nightly and release builds are both supported by the same workflows. When it comes to release, we could follow the same practice from PyTorch release https://github.com/pytorch/pytorch/blob/main/RELEASE.md of doing branch cut and build the release candidates.

So it seems like I did successfully compile and use CUDA binaries that were built on a different machine than mine which means that the cuda manylinux support is working and that's awesome! Does this give us enough confidence the manylinux support will continue working or should we also have binary validation where we use test machines that are different from the build machines

You are spot on. We do have 2 layers of validation for release. What you see here is the first layer where we have the post build script and the smoke test Python to perform any validations we want on the same runner. If this fails, the binaries would not even be published.

The second validation is indeed performed on a separate set of ephemeral "clean" runners. This is done from here https://github.com/pytorch/builder/actions/workflows/validate-binaries.yml and we run it manually when building a new release candidates for PyTorch and several others (vision, audio, text). This can come later. We also want to answer the question if torchao wants to follow the same release cadence as PyTorch or if it's released independently. The former means that we can hook torchao into the same https://github.com/pytorch/builder/actions/workflows/validate-binaries.yml workflow, while the latter means that a new validate binaries workflow is needed on torchao repo itself.

Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it generally looks good aside from a matrix question for the upload pypi bits.

Comment on lines 19 to 20
test-infra-repository: pytorch/test-infra
test-infra-ref: main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test-infra-repository: pytorch/test-infra
test-infra-ref: main

These are set by default so no need to specify them explicitly

Comment on lines 30 to 52
strategy:
fail-fast: false
matrix:
include:
- repository: pytorch/ao
pre-script: packaging/pre_build_script.sh
post-script: packaging/post_build_script.sh
smoke-test-script: packaging/smoke_test.py
package-name: torchao
name: ${{ matrix.repository }}
uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main
with:
repository: ${{ matrix.repository }}
ref: ""
test-infra-repository: pytorch/test-infra
test-infra-ref: main
build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
env-var-script: packaging/env_var_script_linux.sh
pre-script: ${{ matrix.pre-script }}
post-script: ${{ matrix.post-script }}
package-name: ${{ matrix.package-name }}
smoke-test-script: ${{ matrix.smoke-test-script }}
trigger-event: ${{ github.event_name }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
strategy:
fail-fast: false
matrix:
include:
- repository: pytorch/ao
pre-script: packaging/pre_build_script.sh
post-script: packaging/post_build_script.sh
smoke-test-script: packaging/smoke_test.py
package-name: torchao
name: ${{ matrix.repository }}
uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main
with:
repository: ${{ matrix.repository }}
ref: ""
test-infra-repository: pytorch/test-infra
test-infra-ref: main
build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
env-var-script: packaging/env_var_script_linux.sh
pre-script: ${{ matrix.pre-script }}
post-script: ${{ matrix.post-script }}
package-name: ${{ matrix.package-name }}
smoke-test-script: ${{ matrix.smoke-test-script }}
trigger-event: ${{ github.event_name }}
name: build-wheel
uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main
with:
build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
pre-script: packaging/pre_build_script.sh
post-script: packaging/post_build_script.sh
package-name: torchao
smoke-test-script: packaging/smoke_test.py
trigger-event: ${{ github.event_name }}

This can be simplified to utilize the proper defaults, also a matrix is unnecessary here since you're only doing this for one repository

Comment on lines 62 to 65
repository: pytorch/ao
ref: ""
test-infra-repository: pytorch/test-infra
test-infra-ref: main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repository: pytorch/ao
ref: ""
test-infra-repository: pytorch/test-infra
test-infra-ref: main

These are already set by default so you don't actually need them

@@ -0,0 +1,78 @@
name: upload-pypi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually might be good to breakout to pytorch/test-infra eventually, seems generally useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me do that. Moving this to test-infra would actually solve the other comments you have about the complexity of the input matrix. I only need the matrix to feed it into ./test-infra/.github/actions/setup-binary-upload, the one that set env.ARTIFACT_NAME. For the context, this setup allows the upload to go on even when some build config fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# TODO (huydhn): This is a work around to upload only CUDA 12.1 to pypi.
# Other binaries for different CUDA versions will be available on
# download.pytorch.org
if: ${{ inputs.trigger-event == 'schedule' && matrix.desired_cuda == 'cu121' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so does this mean that we spawn a bunch of jobs here to upload to pypi but only actually do the upload for cu121?

Why not just make it so it only does the upload for the specific one we want instead of spawning a bunch of jobs that are basically no-ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I tried to skip those no-op upload, but thought that checking for cu121 was like a hack anyway. So, I plan to make change further upstream to pytorch/test-infra/.github/workflows/generate_binary_build_matrix.yml@main later to limit the cuda version there instead.

Anyway, I think those redundant jobs will be dealt with after I update test-infra upload step to handle pypi upload. A small refactor is on the way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I do want us to mimic what torch does. So for example in our official releases use cuda 11.8 and 12.1. Not sure if policy is different for cuda versions between nightlies and official release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyTorch, both nightly and official releases can be installed from https://download.pytorch.org for all CUDA versions. We then upload to pypi only one blessed CUDA version, which is 12.1 atm https://pypi.org/project/torch/2.3.0/#files. So, what we have here indeed mimic what PyTorch has.

with:
package-type: wheel
os: linux
with-cpu: disable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would still want to upload cpu only binaries

huydhn added a commit to pytorch/test-infra that referenced this pull request May 20, 2024
@msaroufim msaroufim self-requested a review May 20, 2024 17:26
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work thank you!

@huydhn
Copy link
Contributor Author

huydhn commented May 20, 2024

@pytorchbot drci

@huydhn huydhn merged commit e0affd6 into main May 20, 2024
53 of 54 checks passed
@msaroufim msaroufim mentioned this pull request May 20, 2024
huydhn added a commit to pytorch/test-infra that referenced this pull request May 21, 2024
A follow-up after #5217 and
pytorch/ao#250, repo like `torch/ao` doesn't
have a nightly branch and build nightly directly out of main on a daily
schedule.
huydhn added a commit to pytorch/test-infra that referenced this pull request May 22, 2024
A follow-up after #5217 and
pytorch/ao#250, repo like torch/ao doesn't have
a nightly branch and build nightly directly out of main on a daily
schedule. This was missed in
#5222. It's a bit tricky to
get this right off the bat.

Also need to land D57657633 to allow nightly upload on the AWS side.
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* Add nightly build workflow

* Fix wrong env script path

* Install pip deps

* Try to install gxx_linux-64 and gcc_linux-64

* Another attempt

* Set TORCH_CUDA_ARCH_LIST

* Does this work?

* Attempt to upload to pypi

* Pass the arch

* Try pypa/gh-action-pypi-publish@release/v1

* Wrong workflow syntax

* Try auditwheel

* Repair wheel manylinux2014_x86_64

* Debug

* Debug

* Almost there

* Debug

* Almost there

* Pass -y to uninstall

* Final testing

* Upload other python versions

* Remove debug

* Remove build.yml

* Add more validations

* Run all unit tests

* Fix typo

* Run only test_ops

* Passing the secrets

* Move pypi upload to test-infra

* Enable CPU build

* Switch back to main after landing pytorch/test-infra#5217

* Remove unrelated copy/paste comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries/all CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants