Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Self-Hosted AWS GPU Runner #100

Merged
merged 31 commits into from
Apr 27, 2023
Merged
Changes from 27 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
c5ebdca
added self hosted GPU runner CI file
mikemhenry Apr 19, 2023
361ee51
should switch to micromamba
mikemhenry Apr 19, 2023
84c7d05
if this doesn't work we are using micromamba
mikemhenry Apr 19, 2023
9b20013
use micromamba
mikemhenry Apr 20, 2023
408acdf
needed to give env a name
mikemhenry Apr 20, 2023
ed5e806
see if switching the cudatoolkit to 11.7 works
mikemhenry Apr 20, 2023
dc0f828
should be able to use nvcc from the ami
mikemhenry Apr 20, 2023
4b9c3ee
fix some version pins
mikemhenry Apr 20, 2023
c3daced
Remove pins from environment.yml
mikemhenry Apr 20, 2023
c947b7c
set HOME
mikemhenry Apr 20, 2023
c0cc3a6
forgot how to set envars
mikemhenry Apr 20, 2023
9fbc4d9
Add some debugging
mikemhenry Apr 20, 2023
c6ef08f
getting some weird activation problems
mikemhenry Apr 20, 2023
e5f9d44
Remove debugging output
mikemhenry Apr 20, 2023
399a770
see if now that things are working, I can override the pins
mikemhenry Apr 20, 2023
48f37c4
see if this works without activating
mikemhenry Apr 20, 2023
9a89bb1
Fix the build that doesn't use cuda
mikemhenry Apr 20, 2023
752ee78
Accidently kept a GPU package in base env
mikemhenry Apr 20, 2023
ec0576f
keep the environment.yml in the root of the repo intact, move custom …
mikemhenry Apr 20, 2023
b2e29b1
missed a path
mikemhenry Apr 20, 2023
fcb3d6f
Add caching to speed up env creation
mikemhenry Apr 20, 2023
622ec72
revert to keep PR as small as possible
mikemhenry Apr 20, 2023
000c190
accidently checkouted wrong versions from stale fork
mikemhenry Apr 20, 2023
23e9529
forgot to use the cudatoolkit from the ami instead of Jimver/cuda-too…
mikemhenry Apr 20, 2023
ef0392a
forgot to set home
mikemhenry Apr 20, 2023
8a3fd2c
make sure we init the shell
mikemhenry Apr 20, 2023
6902465
missed a reference to the build matrix
mikemhenry Apr 20, 2023
d9917fc
set timeout to be 1 hr
mikemhenry Apr 21, 2023
9a695d5
make it easier to update the versions
mikemhenry Apr 25, 2023
787fc70
had the env in the wrong spot
mikemhenry Apr 25, 2023
40b154c
don't run on a schedule and timeout after 25 minutes
mikemhenry Apr 26, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions .github/workflows/self-hosted-gpu-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
name: self-hosted-gpu-test
on:
push:
branches:
- master
- feat/add_aws_testing
raimis marked this conversation as resolved.
Show resolved Hide resolved
workflow_dispatch:
schedule:
# weekly tests
- cron: "0 0 * * SUN"

defaults:
run:
shell: bash -l {0}

jobs:
start-runner:
name: Start self-hosted EC2 runner
runs-on: ubuntu-latest
raimis marked this conversation as resolved.
Show resolved Hide resolved
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Try to start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@main
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-04d16a12bbc76ff0b
ec2-instance-type: g4dn.xlarge
subnet-id: subnet-0dee8543e12afe0cd # us-east-1a
security-group-id: sg-0f9809618550edb98
# iam-role-name: self-hosted-runner # optional, requires additional permissions
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]

do-the-job:
name: Do the job on the runner
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
timeout-minutes: 1200 # 20 hrs
raimis marked this conversation as resolved.
Show resolved Hide resolved
steps:


- name: Check out
uses: actions/checkout@v3

- name: Install Miniconda
uses: conda-incubator/setup-miniconda@v2
env:
HOME: /home/ec2-user
with:
activate-environment: ""
auto-activate-base: true
miniforge-variant: Mambaforge

- name: Prepare dependencies (with CUDA)
run: |
sed -i -e "/cudatoolkit/c\ - cudatoolkit 11.7.*" \
-e "/gxx_linux-64/c\ - gxx_linux-64 10.3.*" \
-e "/torchani/c\ - torchani 2.2*" \
-e "/nvcc_linux-64/c\ - nvcc_linux-64 11.7.*" \
-e "/python/c\ - python 3.10.*" \
-e "/pytorch-gpu/c\ - pytorch-gpu 2.0.*" \
environment.yml
raimis marked this conversation as resolved.
Show resolved Hide resolved

- name: Show dependency file
run: cat environment.yml

- name: Install dependencies
run: |
mamba env create -n nnpops -f environment.yml
conda init

- name: List conda environment
run: |
conda activate nnpops
conda list

- name: Configure, compile, and install
run: |
conda activate nnpops
mkdir build && cd build
cmake .. \
-DENABLE_CUDA=true \
-DTorch_DIR=$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')/Torch \
-DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
make install

- name: Test
run: |
conda activate nnpops
cd build
ctest --verbose

stop-runner:
name: Stop self-hosted EC2 runner
needs:
- start-runner # required to get output from the start-runner job
- do-the-job # required to wait when the main job is done
runs-on: ubuntu-latest
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@main
with:
mode: stop
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}