Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Self-Hosted Agent Pools #395

Closed
RichiCoder1 opened this issue Mar 30, 2020 · 14 comments
Closed

Elastic Self-Hosted Agent Pools #395

RichiCoder1 opened this issue Mar 30, 2020 · 14 comments
Labels

Comments

@RichiCoder1
Copy link

Sorry if this isn't the right repo!

Describe the enhancement
Support for elastic self-hosted pools. The biggest downside of using self-hosted over the GitHub-ran agents is the lack of scalability (and the loss of clean states when running).

It'd be great if you could, through some mechanism, allow the ability to scale up and down self-hosted agents.

Azure DevOps Pipelines is currently exploring a way of doing this, though their method takes advantage of having full access to a customers Azure Subscription where they can do the hard lifting of setting up scale sets and automating them.

Another nice option that would likely require a heavy lift would be native support for Kubernetes. Possibly the biggest issue there would be making Action's native container support play well, but this would allow actions to take advantage of Kube's native scaling support via things like HPA/VPA and Cluster Autoscaling. It could, in theory, also be done generically so you wouldn't have to build out a provider-specific solution. Prior art here is Gitlab and Jenkins.

@RichiCoder1 RichiCoder1 added the enhancement New feature or request label Mar 30, 2020
@TingluoHuang
Copy link
Member

@chrispat @thejoebourneidentity from Product for this request

@acmcelwee
Copy link

acmcelwee commented Apr 2, 2020

We're currently heavy CircleCI users, but I've started a pilot to augment one of our repos to to run certain CI-related tasks in a combination of GH-hosted runners and self-hosted runners (for the integration tests that need private network access to other apps in our VPC).

I've spent a decent amount of time working on this over the past two weeks for this project. At the top of the list of our requirements, we have:

  • Ideally, containerized. Personally, I'd rather avoid a VM-centric approach that would likely have more scale-out latency to burst with common workload spikes.
  • Clean-slate for each build
  • Can run in our shared, multi-tenant container orchestration environment (currently running in ECS, but would also be nice to see a k8s-native solution, as well). The motivation here is to use the spare compute that we have there, versus spinning up more underutilized infrastructure that will have bursty usage.
  • Doesn't require any manual registration/cleanup for the runner

Achieving anything remotely resembling something like this has required a lot of moving parts, so I'll share my journey here, for anyone that finds it useful. I started with some of the nuggets in this 4-part blog series, specifically:

  • automating the config of an ephemeral runner with the undocumented flags in the config.sh script
  • automating the removal of the runner
  • the also undocumented ./run.sh --once functionality

Armed with those new discoveries, I started started building out a debian-based docker image that includes all of the necessary GH Actions packages (installed from the latest GH repo release), my setup/run/teardown scripts that I've written, and wired it all together by using the rock-solid s6-overlay as the container lifecycle management piece. With all of that, I now have clean-slate containers that KMS-decrypt the GH Personal Access Token (PAT) that is supplied via an env var, registers w/ the GH repo with the PAT, starts a run.sh --once that will run forever until it gets a job assigned, and performs the cleanup action to remove itself from the repo when the job is no longer running.

To make this a user-friendly system, I also needed to add a mechanism to know when to scale the number of running containers that distinguishes between a running container that's ready to accept a job and a running container that's already running a job (and that would be shut down as soon as the job is complete). The best indicator I was able to find in the docs was that I have to look for a couple magic values in the runner output, as documented here. In the container, I added a simple node process that examines the output of the runner and publishes two metrics to CloudWatch:

  • Whether the runner has fully initialized
  • Whether the runner is currently running a job

I just finished setting up the ECS service autoscaling rules to scale up the number of running containers, based on those cloudwatch rules. With the autoscaling up, that now brings me to one of the final issues to work through. Gracefully scaling back in...

With ECS, I can't selectively scale back in by terminating specific tasks. I can only reduce the desired count, let the ECS engine pick a task to stop, and adjust the configs to allow a very delayed shutdown process. I was able to configure the ECS container/task config to allow a long delay (multiple minutes) between the SIGTERM and when it will ultimately trigger a SIGKILL. I also had to catch INT/TERM signals in a couple places and use some of the lifecycle pieces of the s6 overlay to give the running job a reasonable amount of time to finish before the processes get a final KILL signal.

Another nice to have would be an api on the GH side to indicate the current build queue depth. For now, I'm just scaling up by two for each minute that we have less than 3 available runners. If we have a build queue that piles up quickly, it would take a few minutes to react to that, and our team is used to the near-instant start times with CircleCI.

All in all, I've found a lot to like about GH Actions, and I'm looking forward to continued improvement so we can confidently continue converting our workloads to run 100% as Actions.

@davidkarlsen
Copy link

Related: #454

@thomwiggers
Copy link

Gitlab's CI also supports autoscaling via docker engine which allows to automatically provision, start and destroy AWS, DO, GCE, ... boxes. It'd be great to be able to do something similar.

@halberom
Copy link

Related: #493

@simonbyrne
Copy link

I would also be very interested in something along these lines, especially if it would be possible to customize the backend runner. Our particular use case is hooking it up with a HPC cluster scheduler such as Slurm: ideally we would have a single daemon running on a login node, and each action would schedule a job to run on the cluster (without blocking additional actions from being scheduled).

The US Department of Energy is building something similar with GitLab CI: https://ecp-ci.gitlab.io/

@davidkarlsen
Copy link

#454 is fixed - and so is evryfs/github-actions-runner-operator#2 - which implements elastic build pools.

@renaudhager
Copy link

Hi guys,

If I understand how github actions works, when there is not runner register for a repo, any job fails. Jobs are only queued/pending when they are runner that are busy/offline.

Which mean there is no way for a system to detect that a job need a runner, which make on demand runner almost impossible unless there is always a configured runner for a repo, which does not scale very well when the number of repos increase.

It seems to be a design choice to fails all the jobs when there is no worker. A work around could be to register 1 "dead" worker for each repo... but it seems that the registration process is not clearly defined in the documentation (https://developer.github.com/v3/actions/self-hosted-runners/) or I did not find it.

Can you share the api calls required to register a runner?

@davidkarlsen
Copy link

davidkarlsen commented Nov 21, 2020

@renaudhager you find the API-calls here: https://docs.github.com/en/free-pro-team@latest/rest/reference/actions#self-hosted-runners - there needs to be a match on the labels in order for the job to be queued, if there are no runners with matching labels the job fails.

@renaudhager
Copy link

@davidkarlsen,

Thank you.
I found and read this page. But it does not describe what call needs to be done to register a runner.
It tells you how to create a registration token:

 Create a registration token for a repository

Returns a token that you can pass to the config script. The token expires after one hour. You must authenticate using an access token with the repo scope to use this endpoint.

and what API calls need to be made to delete a self hosted runner but the actual call made by the config script is not describe, or I did not find it.
My need is that I was would like to register a runner wihtout using the config script and the binary called by this script.

Also any specific reasons to fail the job when there is no runner that match the labels, it could be queued and wait.
When they are offline runner it does that, so why fail the job when there is none. It makes any on demand solution more complicated (hence the need to register a runner without using all the boilerplate stuff from the runner archive)

@davidkarlsen
Copy link

@renaudhager You can have a look in https://github.com/evryfs/github-actions-runner/blob/master/entrypoint.sh#L17 which is a script which:

  1. acquires the reg. token
  2. registers runner using the token

Why github has chosen to design this as they have I do not know - I am merely a humble user...

@renaudhager
Copy link

Thanks for your help @davidkarlsen.
I looked at entrypoint.sh, it uses the registration token in config.sh.
But config.sh is just a wrapper around binary, therefore I can what is the exact API call made by the binary.
Here an example:

github@github-runner-sre-docker-image-packer-1e2ce01f:~$ ls -l
total 56
drwxr-xr-x 3 github github  4096 Dec  3 00:14 _diag
drwxr-xr-x 7 root   root    4096 Dec  3 00:14 _work
drwxr-xr-x 3 github github 16384 Nov 16 13:37 bin
-rwxr-xr-x 1 github github  2452 Nov 16 13:35 config.sh
-rwxr--r-- 1 github github  1207 Nov 20 14:06 entrypoint.sh
-rwxr-xr-x 1 github github   623 Nov 16 13:35 env.sh
drwxr-xr-x 4 github github  4096 Nov 16 13:36 externals
-rwxr-xr-x 1 github github  1666 Nov 16 13:35 run.sh
-rwxr--r-- 1 github github   513 Nov 20 14:05 runsvc.sh
-rwxr-xr-x 1 github github  4727 Dec  2 23:16 svc.sh
github@github-runner-sre-docker-image-packer-1e2ce01f:~$ ls -l ./bin/Runner.Listener
-rwxr-xr-x 1 github github 90712 Nov 16 13:36 ./bin/Runner.Listener
github@github-runner-sre-docker-image-packer-1e2ce01f:~$ file ./bin/Runner.Listener
./bin/Runner.Listener: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=6644aee648567ad2d7c248a017cbc94d6f5dda3c, stripped
github@github-runner-sre-docker-image-packer-1e2ce01f:~$

github@github-runner-sre-docker-image-packer-1e2ce01f:~$ cat config.sh  | grep -i token
github@github-runner-sre-docker-image-packer-1e2ce01f:~$ tail -n 20 config.sh

# Change directory to the script root directory
# https://stackoverflow.com/questions/59895/getting-the-source-directory-of-a-bash-script-from-within
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
cd "$DIR"

source ./env.sh

shopt -s nocasematch
if [[ "$1" == "remove" ]]; then
    ./bin/Runner.Listener "$@"
else
    ./bin/Runner.Listener configure "$@"
fi
github@github-runner-sre-docker-image-packer-1e2ce01f:~$

As we can see, config.sh does not do any API call, just passing all args to ./bin/Runner.Listener.

That is why I asked, because I can find anything related to the registration API call :-(

@tomkerkhove
Copy link

Those who are interested, we are exploring options to support autoscaling with KEDA.

Please head over to kedacore/keda#1732 and give a 👍.

@ethomson
Copy link
Contributor

ethomson commented Mar 1, 2022

It'd be great if you could, through some mechanism, allow the ability to scale up and down self-hosted agents.

Yes! We're very interested in this -- and please don't take me closing this issue as anything but. 😁

This issue is a bit old, so I wanted to give my current thoughts:

The runner itself (meaning: the application in this repository) itself will not do any autoscaling, instead compute will be scaled up and the runner (again, meaning the application in this repository) will take a job and run it. We've made some changes to the runner to support this (like the --ephemeral flag). We will likely make more changes as well, but at the core of the problem is an autoscaling application, the runner is not that application.

I'm a big fan of Actions Runner Controller. We have people on the team who are contributing to it. If you want to scale containers in Kubernetes, this is what I'd recommend.

There are definitely improvements to be made all over - and we're making them - but I think that this not really a problem that is scoped to this repository, it's bigger than that. So I'm going to close this so that it's not being tracked in the runner team's issue board (in other words, this repository). If you have more feedback, then the Actions feedback repository is the best place to discuss further. Thanks!

@ethomson ethomson closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants