-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Self-Hosted Agent Pools #395
Comments
@chrispat @thejoebourneidentity from Product for this request |
We're currently heavy CircleCI users, but I've started a pilot to augment one of our repos to to run certain CI-related tasks in a combination of GH-hosted runners and self-hosted runners (for the integration tests that need private network access to other apps in our VPC). I've spent a decent amount of time working on this over the past two weeks for this project. At the top of the list of our requirements, we have:
Achieving anything remotely resembling something like this has required a lot of moving parts, so I'll share my journey here, for anyone that finds it useful. I started with some of the nuggets in this 4-part blog series, specifically:
Armed with those new discoveries, I started started building out a debian-based docker image that includes all of the necessary GH Actions packages (installed from the latest GH repo release), my setup/run/teardown scripts that I've written, and wired it all together by using the rock-solid s6-overlay as the container lifecycle management piece. With all of that, I now have clean-slate containers that KMS-decrypt the GH Personal Access Token (PAT) that is supplied via an env var, registers w/ the GH repo with the PAT, starts a To make this a user-friendly system, I also needed to add a mechanism to know when to scale the number of running containers that distinguishes between a running container that's ready to accept a job and a running container that's already running a job (and that would be shut down as soon as the job is complete). The best indicator I was able to find in the docs was that I have to look for a couple magic values in the runner output, as documented here. In the container, I added a simple node process that examines the output of the runner and publishes two metrics to CloudWatch:
I just finished setting up the ECS service autoscaling rules to scale up the number of running containers, based on those cloudwatch rules. With the autoscaling up, that now brings me to one of the final issues to work through. Gracefully scaling back in... With ECS, I can't selectively scale back in by terminating specific tasks. I can only reduce the desired count, let the ECS engine pick a task to stop, and adjust the configs to allow a very delayed shutdown process. I was able to configure the ECS container/task config to allow a long delay (multiple minutes) between the SIGTERM and when it will ultimately trigger a SIGKILL. I also had to catch INT/TERM signals in a couple places and use some of the lifecycle pieces of the s6 overlay to give the running job a reasonable amount of time to finish before the processes get a final KILL signal. Another nice to have would be an api on the GH side to indicate the current build queue depth. For now, I'm just scaling up by two for each minute that we have less than 3 available runners. If we have a build queue that piles up quickly, it would take a few minutes to react to that, and our team is used to the near-instant start times with CircleCI. All in all, I've found a lot to like about GH Actions, and I'm looking forward to continued improvement so we can confidently continue converting our workloads to run 100% as Actions. |
Related: #454 |
Gitlab's CI also supports autoscaling via docker engine which allows to automatically provision, start and destroy AWS, DO, GCE, ... boxes. It'd be great to be able to do something similar. |
Related: #493 |
I would also be very interested in something along these lines, especially if it would be possible to customize the backend runner. Our particular use case is hooking it up with a HPC cluster scheduler such as Slurm: ideally we would have a single daemon running on a login node, and each action would schedule a job to run on the cluster (without blocking additional actions from being scheduled). The US Department of Energy is building something similar with GitLab CI: https://ecp-ci.gitlab.io/ |
#454 is fixed - and so is evryfs/github-actions-runner-operator#2 - which implements elastic build pools. |
Hi guys, If I understand how github actions works, when there is not runner register for a repo, any job fails. Jobs are only queued/pending when they are runner that are busy/offline. Which mean there is no way for a system to detect that a job need a runner, which make on demand runner almost impossible unless there is always a configured runner for a repo, which does not scale very well when the number of repos increase. It seems to be a design choice to fails all the jobs when there is no worker. A work around could be to register 1 "dead" worker for each repo... but it seems that the registration process is not clearly defined in the documentation (https://developer.github.com/v3/actions/self-hosted-runners/) or I did not find it. Can you share the api calls required to register a runner? |
@renaudhager you find the API-calls here: https://docs.github.com/en/free-pro-team@latest/rest/reference/actions#self-hosted-runners - there needs to be a match on the labels in order for the job to be queued, if there are no runners with matching labels the job fails. |
Thank you.
and what API calls need to be made to delete a self hosted runner but the actual call made by the config script is not describe, or I did not find it. Also any specific reasons to fail the job when there is no runner that match the labels, it could be queued and wait. |
@renaudhager You can have a look in https://github.com/evryfs/github-actions-runner/blob/master/entrypoint.sh#L17 which is a script which:
Why github has chosen to design this as they have I do not know - I am merely a humble user... |
Thanks for your help @davidkarlsen.
As we can see, That is why I asked, because I can find anything related to the registration API call :-( |
Those who are interested, we are exploring options to support autoscaling with KEDA. Please head over to kedacore/keda#1732 and give a 👍. |
Yes! We're very interested in this -- and please don't take me closing this issue as anything but. 😁 This issue is a bit old, so I wanted to give my current thoughts: The runner itself (meaning: the application in this repository) itself will not do any autoscaling, instead compute will be scaled up and the runner (again, meaning the application in this repository) will take a job and run it. We've made some changes to the runner to support this (like the I'm a big fan of Actions Runner Controller. We have people on the team who are contributing to it. If you want to scale containers in Kubernetes, this is what I'd recommend. There are definitely improvements to be made all over - and we're making them - but I think that this not really a problem that is scoped to this repository, it's bigger than that. So I'm going to close this so that it's not being tracked in the runner team's issue board (in other words, this repository). If you have more feedback, then the Actions feedback repository is the best place to discuss further. Thanks! |
Sorry if this isn't the right repo!
Describe the enhancement
Support for elastic self-hosted pools. The biggest downside of using self-hosted over the GitHub-ran agents is the lack of scalability (and the loss of clean states when running).
It'd be great if you could, through some mechanism, allow the ability to scale up and down self-hosted agents.
Azure DevOps Pipelines is currently exploring a way of doing this, though their method takes advantage of having full access to a customers Azure Subscription where they can do the hard lifting of setting up scale sets and automating them.
Another nice option that would likely require a heavy lift would be native support for Kubernetes. Possibly the biggest issue there would be making Action's native container support play well, but this would allow actions to take advantage of Kube's native scaling support via things like HPA/VPA and Cluster Autoscaling. It could, in theory, also be done generically so you wouldn't have to build out a provider-specific solution. Prior art here is Gitlab and Jenkins.
The text was updated successfully, but these errors were encountered: