-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP VM instance does not terminate after idle timeout #1386
Comments
@hopeai Can you share more of the workflow that you are running even if it's mostly redacted? I going to try and reproduce the behavior you are seeing again. |
Sure @dacbd, I'm using the following agents:
queue: "my-buildkite-agent"
steps:
- block: ":warning: Provision ec2 instance"
- command: "buildkite-agent pipeline upload .buildkite/cml_launch_runner.yml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'"
- wait
- command: "buildkite-agent pipeline upload .buildkite/trigger_my_buildkite_pipeline.yaml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'" The two pipelines which previous pipeline uploads are as follows:
agents:
queue: "my-buildkite-agent"
steps:
- label: ":gcloud: setup gcp creds"
command: "./scripts/configure_gcloud.sh"
- wait
- label: ":gcloud: launch GCP Snapshot Creation Instance"
command:
- "./scripts/cml_launch_runner.sh"
plugins:
- docker#v5.3.0:
image: "iterativeai/cml:latest"
environment:
- "BUILDKITE_GITHUB_USERNAME"
- "BUILDKITE_GITHUB_TOKEN"
- "BUILDKITE_BRANCH"
- "BUILDKITE_BUILD_NUMBER"
- "BUILDKITE_COMMIT"
- "GOOGLE_APPLICATION_CREDENTIALS_DATA"
agents:
queue: "my-buildkite-agent"
steps:
- label: ":github: Trigger Github Actions"
command:
- "./scripts/trigger_github_workflow.sh" The bash scripts which are used in the pipelines are as follows:
#!/bin/bash
set -eu -o pipefail
echo ":gcloud: Configure gcloud..."
export GOOGLE_APPLICATION_CREDENTIALS_DATA=$BUILDKITE_GS_APPLICATION_CREDENTIALS_JSON
echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project MY-PROJECT
gcloud -q auth configure-docker
gcloud -v
echo "Done!"
#!/bin/bash
set -eu -o pipefail
# Build a unique instance name
CML_RUNNER_NAME="some-name"
# Provision an GCP EC2 g4dn.xlage instance
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4 \
--name="$CML_RUNNER_NAME" \
--labels=gpu-cml \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/myrepo.git \
--cloud-hdd-size=60 \
--idle-timeout=120
#!/bin/bash
curl -L \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $BUILDKITE_GITHUB_TOKEN" \
-H "X-GitHub-Api-Version: 2022-11-28" \
--url https://api.github.com/repos/intenseye/myrepo/actions/workflows/my_workflow.yml/dispatches \
-d '{"ref":"'"$BUILDKITE_BRANCH"'"}' By running these pipelines everything works as expected and cml starts a gpu server on GCP and this self-hosted runner accepts my GitHub workflow jobs and jobs finish successfully. However, after idle-timeout GCP VM is not terminated. I checked the logs in self-hosted runner and shared here. I think this could help you to reproduce this behavior.
|
The total runtime Is close to one hour which makes me think that your I believe if you run the following query in GCP's "log explorer" you will NOT see a delete request (because it will be unauthorised from the expired credentials)
Can you try authenticating your github actions via the following: - name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
credentials_json: ${{ secrets.GCP_CML_RUNNER_KEY }} where you can also configure docker in a similar fashion: - name: Login to GAR
uses: docker/login-action@v2
with:
registry: <location>-docker.pkg.dev
username: _json_key
password: ${{ secrets.GAR_JSON_KEY }} |
Thanks a lot @dacbd for pointing this out. Let me try this |
Related to issue #834
Facing this issue with cml v0.19.0
I'm starting GCP instance using the following command inside a docker container which uses
iterativeai/cml:latest
image in a buildkite pipeline. I also shareGOOGLE_APPLICATION_CREDENTIALS_DATA
as environment variable, it starts and register self-hosted runner without any problem.I configured gcloud with:
run
cml runner launch
withHowever, after idle-timeout passes GCP VM doesn't terminate. The output of
journalctl --unit cml --no-pager
is as follows:I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.
The text was updated successfully, but these errors were encountered: