Add nightly build workflow #250

huydhn · 2024-05-16T21:37:30Z

Another attempt to properly use Nova wheel build workflow because #249 doesn't seem to build the correct wheel.

This requires pytorch/test-infra#5217 to handle the pypi upload.

pytorch-bot · 2024-05-16T21:37:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/250

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8048bc7 with merge base e9e5fae ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2024-05-16T21:49:48Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

msaroufim · 2024-05-17T03:03:15Z

Thanks Huy! Am I reading this correctly that this is still not a manylinux binary? https://github.com/pytorch/ao/actions/runs/9122140883/job/25082459762?pr=250#step:14:538

Also re the upload where will this upload? It seems like the pytorch s3 bucket? s3://pytorch/whl/nightly/cu124/torchao_nightly-2024.5.17-cp310-cp310-linux_x86_64.whl does this get copied in a cron job to https://download.pytorch.org/whl/nightly/torchao/ ?

huydhn · 2024-05-17T04:01:48Z

Thanks Huy! Am I reading this correctly that this is still not a manylinux binary? https://github.com/pytorch/ao/actions/runs/9122140883/job/25082459762?pr=250#step:14:538

I'm asking myself the same question. I think the workflow does its job expectedly because I can see PyTorch nightly binaries in the same format https://download.pytorch.org/whl/nightly/torch/ or https://download.pytorch.org/whl/nightly/torchvision. PyTorch wheels go through this script before being uploaded to pypi https://github.com/pytorch/builder/blob/main/release/promote/wheel_to_pypi.sh#L45-L59, so I'm pretty sure that this is where it preps and renames the file so that it becomes manylinux1 on PYPI, i.e. https://pypi.org/project/torch/2.3.0/#files

Also re the upload where will this upload? It seems like the pytorch s3 bucket? s3://pytorch/whl/nightly/cu124/torchao_nightly-2024.5.17-cp310-cp310-linux_x86_64.whl does this get copied in a cron job to https://download.pytorch.org/whl/nightly/torchao/ ?

Yeah, the file will be uploaded to https://download.pytorch.org/whl/nightly under different CUDA arch, for example https://download.pytorch.org/whl/nightly/cu121. Their installation instruction will be pip install torchao --index-url https://download.pytorch.org/whl/nightly/cu121 or pip install torchao --index-url https://download.pytorch.org/whl/nightly (it defaults to cu121 IIRC)

huydhn · 2024-05-17T04:11:45Z

About the script https://github.com/pytorch/builder/blob/main/release/promote/wheel_to_pypi.sh#L45-L59, I will check with @atalman and follow up on this tomorrow, we have several choices:

Use https://download.pytorch.org/whl/nightly because the files will already be there. The upload to pypi will be done manually. Then there is no need to figure out how to upload to pypi in this PR
Or adopt part of the script https://github.com/pytorch/builder/blob/main/release/promote/wheel_to_pypi.sh#L45-L59 to torchao to prep the wheels before uploading them to pypi. Then, the current behavior remains the same. I would try this first and see how it goes

huydhn · 2024-05-17T23:47:08Z

@msaroufim Could you help confirm that the nightly binaries are correct https://pypi.org/project/torchao-nightly/2024.5.17/#files?

msaroufim

Awesome!

The main more general questions I have are

It seems like this script already also does official releases by switching this variable here to anything except a value including the word nightly

     ref:
        description: Reference to checkout
        default: nightly

And the second question is I just did a manual validation of the nightly binaries by running

pip install torchao-nightly==2024.5.17
python
import torchao.ops
dir(torch.ops)

And then

pytest test/test_ops.py .......

So it seems like I did successfully compile and use CUDA binaries that were built on a different machine than mine which means that the cuda manylinux support is working and that's awesome! Does this give us enough confidence the manylinux support will continue working or should we also have binary validation where we use test machines that are different from the build machines

packaging/post_build_script.sh

msaroufim · 2024-05-18T02:20:03Z

packaging/pre_build_script.sh

+
+pip install setuptools wheel twine auditwheel
+
+# NB: This will make the nightly wheel compatible with the latest stable torch,


Yeah I think the correct release strategy might be

Nightly ao with nightly torch

Release ao with release torch

So at least to merge this let's use the nightly torch versions

Then the fix is very simple, we just need to remove the step to install torch here. The build workflow has already installed the correct torch version, i.e. nightly/release torch for nightly/release ao respectively

packaging/smoke_test.py

huydhn · 2024-05-18T04:58:42Z

It seems like this script already also does official releases by switching this variable here to anything except a value including the word nightly

Yes, that's the benefit of using Nova build workflow. Nightly and release builds are both supported by the same workflows. When it comes to release, we could follow the same practice from PyTorch release https://github.com/pytorch/pytorch/blob/main/RELEASE.md of doing branch cut and build the release candidates.

So it seems like I did successfully compile and use CUDA binaries that were built on a different machine than mine which means that the cuda manylinux support is working and that's awesome! Does this give us enough confidence the manylinux support will continue working or should we also have binary validation where we use test machines that are different from the build machines

You are spot on. We do have 2 layers of validation for release. What you see here is the first layer where we have the post build script and the smoke test Python to perform any validations we want on the same runner. If this fails, the binaries would not even be published.

The second validation is indeed performed on a separate set of ephemeral "clean" runners. This is done from here https://github.com/pytorch/builder/actions/workflows/validate-binaries.yml and we run it manually when building a new release candidates for PyTorch and several others (vision, audio, text). This can come later. We also want to answer the question if torchao wants to follow the same release cadence as PyTorch or if it's released independently. The former means that we can hook torchao into the same https://github.com/pytorch/builder/actions/workflows/validate-binaries.yml workflow, while the latter means that a new validate binaries workflow is needed on torchao repo itself.

seemethere

I think it generally looks good aside from a matrix question for the upload pypi bits.

seemethere · 2024-05-18T20:44:20Z

.github/workflows/build_wheels_linux.yml

+      test-infra-repository: pytorch/test-infra
+      test-infra-ref: main


Suggested change

test-infra-repository: pytorch/test-infra

test-infra-ref: main

These are set by default so no need to specify them explicitly

seemethere · 2024-05-18T20:46:17Z

.github/workflows/build_wheels_linux.yml

+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - repository: pytorch/ao
+            pre-script: packaging/pre_build_script.sh
+            post-script: packaging/post_build_script.sh
+            smoke-test-script: packaging/smoke_test.py
+            package-name: torchao
+    name: ${{ matrix.repository }}
+    uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main
+    with:
+      repository: ${{ matrix.repository }}
+      ref: ""
+      test-infra-repository: pytorch/test-infra
+      test-infra-ref: main
+      build-matrix: ${{ needs.generate-matrix.outputs.matrix }}
+      env-var-script: packaging/env_var_script_linux.sh
+      pre-script: ${{ matrix.pre-script }}
+      post-script: ${{ matrix.post-script }}
+      package-name: ${{ matrix.package-name }}
+      smoke-test-script: ${{ matrix.smoke-test-script }}
+      trigger-event: ${{ github.event_name }}


Suggested change

strategy:

fail-fast: false

matrix:

include:

- repository: pytorch/ao

pre-script: packaging/pre_build_script.sh

post-script: packaging/post_build_script.sh

smoke-test-script: packaging/smoke_test.py

package-name: torchao

name: ${{ matrix.repository }}

uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main

with:

repository: ${{ matrix.repository }}

ref: ""

test-infra-repository: pytorch/test-infra

test-infra-ref: main

build-matrix: ${{ needs.generate-matrix.outputs.matrix }}

env-var-script: packaging/env_var_script_linux.sh

pre-script: ${{ matrix.pre-script }}

post-script: ${{ matrix.post-script }}

package-name: ${{ matrix.package-name }}

smoke-test-script: ${{ matrix.smoke-test-script }}

trigger-event: ${{ github.event_name }}

name: build-wheel

uses: pytorch/test-infra/.github/workflows/build_wheels_linux.yml@main

with:

build-matrix: ${{ needs.generate-matrix.outputs.matrix }}

pre-script: packaging/pre_build_script.sh

post-script: packaging/post_build_script.sh

package-name: torchao

smoke-test-script: packaging/smoke_test.py

trigger-event: ${{ github.event_name }}

This can be simplified to utilize the proper defaults, also a matrix is unnecessary here since you're only doing this for one repository

seemethere · 2024-05-18T20:46:44Z

.github/workflows/build_wheels_linux.yml

+      repository: pytorch/ao
+      ref: ""
+      test-infra-repository: pytorch/test-infra
+      test-infra-ref: main


Suggested change

repository: pytorch/ao

ref: ""

test-infra-repository: pytorch/test-infra

test-infra-ref: main

These are already set by default so you don't actually need them

seemethere · 2024-05-18T20:47:20Z

.github/workflows/pypi_upload.yml

@@ -0,0 +1,78 @@
+name: upload-pypi


This actually might be good to breakout to pytorch/test-infra eventually, seems generally useful

Yeah, let me do that. Moving this to test-infra would actually solve the other comments you have about the complexity of the input matrix. I only need the matrix to feed it into ./test-infra/.github/actions/setup-binary-upload, the one that set env.ARTIFACT_NAME. For the context, this setup allows the upload to go on even when some build config fails

Here it goes pytorch/test-infra#5217

seemethere · 2024-05-18T20:48:53Z

.github/workflows/pypi_upload.yml

+        # TODO (huydhn): This is a work around to upload only CUDA 12.1 to pypi.
+        # Other binaries for different CUDA versions will be available on
+        # download.pytorch.org
+        if: ${{ inputs.trigger-event == 'schedule' && matrix.desired_cuda == 'cu121' }}


Wait so does this mean that we spawn a bunch of jobs here to upload to pypi but only actually do the upload for cu121?

Why not just make it so it only does the upload for the specific one we want instead of spawning a bunch of jobs that are basically no-ops?

Well, I tried to skip those no-op upload, but thought that checking for cu121 was like a hack anyway. So, I plan to make change further upstream to pytorch/test-infra/.github/workflows/generate_binary_build_matrix.yml@main later to limit the cuda version there instead.

Anyway, I think those redundant jobs will be dealt with after I update test-infra upload step to handle pypi upload. A small refactor is on the way.

So I do want us to mimic what torch does. So for example in our official releases use cuda 11.8 and 12.1. Not sure if policy is different for cuda versions between nightlies and official release

In PyTorch, both nightly and official releases can be installed from https://download.pytorch.org for all CUDA versions. We then upload to pypi only one blessed CUDA version, which is 12.1 atm https://pypi.org/project/torch/2.3.0/#files. So, what we have here indeed mimic what PyTorch has.

msaroufim · 2024-05-20T15:35:06Z

.github/workflows/build_wheels_linux.yml

+    with:
+      package-type: wheel
+      os: linux
+      with-cpu: disable


we would still want to upload cpu only binaries

This is to support torchao use case where it uploads 12.1 CUDA nightly build to https://pypi.org/project/torchao-nightly ### Testing Do some testing on torchao pytorch/ao#250: * CUDA 12.1 is uploaded https://github.com/pytorch/ao/actions/runs/9144442527/job/25142401608?pr=250 to https://pypi.org/project/torchao-nightly/2024.5.19/#files * Other CUDA versions are skipped https://github.com/pytorch/ao/actions/runs/9144442527/job/25142401944

msaroufim

Amazing work thank you!

huydhn · 2024-05-20T18:15:08Z

@pytorchbot drci

A follow-up after #5217 and pytorch/ao#250, repo like `torch/ao` doesn't have a nightly branch and build nightly directly out of main on a daily schedule.

A follow-up after #5217 and pytorch/ao#250, repo like torch/ao doesn't have a nightly branch and build nightly directly out of main on a daily schedule. This was missed in #5222. It's a bit tricky to get this right off the bat. Also need to land D57657633 to allow nightly upload on the AWS side.

* Add nightly build workflow * Fix wrong env script path * Install pip deps * Try to install gxx_linux-64 and gcc_linux-64 * Another attempt * Set TORCH_CUDA_ARCH_LIST * Does this work? * Attempt to upload to pypi * Pass the arch * Try pypa/gh-action-pypi-publish@release/v1 * Wrong workflow syntax * Try auditwheel * Repair wheel manylinux2014_x86_64 * Debug * Debug * Almost there * Debug * Almost there * Pass -y to uninstall * Final testing * Upload other python versions * Remove debug * Remove build.yml * Add more validations * Run all unit tests * Fix typo * Run only test_ops * Passing the secrets * Move pypi upload to test-infra * Enable CPU build * Switch back to main after landing pytorch/test-infra#5217 * Remove unrelated copy/paste comments

Add nightly build workflow

d7e3176

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024

Fix wrong env script path

bbb5da6

huydhn added the ciflow/binaries/all label May 16, 2024

huydhn added 5 commits May 16, 2024 15:04

Install pip deps

824a45c

Try to install gxx_linux-64 and gcc_linux-64

0f6baa5

Another attempt

643d81f

Set TORCH_CUDA_ARCH_LIST

2f60b9d

Does this work?

c55b32d

huydhn marked this pull request as ready for review May 17, 2024 00:56

huydhn requested a review from msaroufim May 17, 2024 00:56

huydhn added 2 commits May 16, 2024 19:23

Attempt to upload to pypi

e3beb78

Pass the arch

9c74d3c

huydhn added 2 commits May 16, 2024 20:47

Try pypa/gh-action-pypi-publish@release/v1

3b244d7

Wrong workflow syntax

c3afbc8

huydhn added 10 commits May 16, 2024 21:48

Try auditwheel

08ef87d

Repair wheel manylinux2014_x86_64

25f43d1

Debug

ed66ae3

Debug

d0459f4

Almost there

740d917

Debug

344f28b

Almost there

b8848f9

Pass -y to uninstall

6f3c172

Final testing

057c4ca

Upload other python versions

0bb6b6c

Remove build.yml

1ee576f

atalman approved these changes May 17, 2024

View reviewed changes

msaroufim reviewed May 18, 2024

View reviewed changes

huydhn added 3 commits May 17, 2024 21:40

Add more validations

f91001e

Merge branch 'main' into another-attempt-to-fix-nightly

61e8080

Run all unit tests

ae2a904

huydhn added 2 commits May 17, 2024 22:00

Fix typo

5a7aad9

Run only test_ops

6dfe549

seemethere requested changes May 18, 2024

View reviewed changes

huydhn added 2 commits May 18, 2024 19:22

Passing the secrets

22a41ba

Move pypi upload to test-infra

18bd9b1

huydhn mentioned this pull request May 19, 2024

Add the option to upload a selected list of CUDA arch to pypi pytorch/test-infra#5217

Merged

huydhn requested a review from seemethere May 19, 2024 02:44

msaroufim reviewed May 20, 2024

View reviewed changes

Enable CPU build

ea8ec8f

Switch back to main after landing pytorch/test-infra#5217

ed96ae3

msaroufim self-requested a review May 20, 2024 17:26

msaroufim approved these changes May 20, 2024

View reviewed changes

Remove unrelated copy/paste comments

8048bc7

huydhn merged commit e0affd6 into main May 20, 2024
53 of 54 checks passed

msaroufim mentioned this pull request May 20, 2024

Nightly Build Fix #249

Closed

huydhn mentioned this pull request May 20, 2024

Also trigger nightly build on schedule pytorch/test-infra#5222

Merged

huydhn mentioned this pull request May 22, 2024

Connect OIDC when building nightly on schedule pytorch/test-infra#5248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nightly build workflow #250

Add nightly build workflow #250

huydhn commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024

msaroufim commented May 17, 2024

huydhn commented May 17, 2024 •

edited

Loading

huydhn commented May 17, 2024

huydhn commented May 17, 2024

msaroufim left a comment •

edited

Loading

msaroufim May 18, 2024

huydhn May 18, 2024

huydhn commented May 18, 2024 •

edited

Loading

seemethere left a comment

seemethere May 18, 2024

seemethere May 18, 2024

seemethere May 18, 2024

seemethere May 18, 2024

huydhn May 19, 2024

huydhn May 19, 2024

seemethere May 18, 2024

huydhn May 19, 2024

msaroufim May 19, 2024

huydhn May 19, 2024

msaroufim May 20, 2024

msaroufim left a comment

huydhn commented May 20, 2024


		pip install setuptools wheel twine auditwheel

		# NB: This will make the nightly wheel compatible with the latest stable torch,

		test-infra-repository: pytorch/test-infra
		test-infra-ref: main

Add nightly build workflow #250

Add nightly build workflow #250

Conversation

huydhn commented May 16, 2024 • edited Loading

pytorch-bot bot commented May 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/250

✅ You can merge normally! (1 Unrelated Failure)

pytorch-bot bot commented May 16, 2024

msaroufim commented May 17, 2024

huydhn commented May 17, 2024 • edited Loading

huydhn commented May 17, 2024

huydhn commented May 17, 2024

msaroufim left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huydhn commented May 18, 2024 • edited Loading

seemethere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

huydhn commented May 20, 2024

huydhn commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading

huydhn commented May 17, 2024 •

edited

Loading

msaroufim left a comment •

edited

Loading

huydhn commented May 18, 2024 •

edited

Loading