This project leverages Terraform and Packer to deploy and maintain scalable self hosted GitHub actions runners infrastructure on GCP for a GitHub organization.
- Auto scaling of runners, supporting simple and powerful scaling policy configuration 🧰
- Fast scaling up thanks to prebuild image via Packer 🚄
- Support Docker out of the box 🏗️
- Cost efficient versus Linux GitHub Runner 💰
- Terraform / Packer project for scalable self hosted GitHub action runners on GCP 🚀
To setup the infrastucture, here are the manual steps which can not be done through Terraform:
- Setup a GitHub App which will be installed in the GitHub organization where self hosted runners will be available. GitHub setup section explains how to setup this GitHub App
- Setup a GCP project which will host the self hosted runners and scaling logic must be created. GCP setup section explains how to setup this GCP project
- Deploy the infrastructure on GCP using Terraform using Packer and Terraform
- Update GitHup App Webhook setting with GCP project information computed during GCP deployment
This section explains how to setup the GitHub App in the GitHub organization where we want to use self hosted runners. The GitHub App will allow GCP to communicate with the GitHub organization in order to create / delete runners and scaling up / down when needed. Check for components scheme for more information. To be able to properly setup the GitHup App in the GitHub organization, you need to be an admin of this GitHub organization.
- Create a GitHub App:
- In
Webhook
, uncheckActive
, and leaveWebhook URL empty for now
. It will be enable in part 4 of the setup. - In
Repository permissions
grantRead-only
permission toChecks
andMetadata
(needed to forward scale up event from GitHub towards GCP) - In
Organization permissions
, grantRead & write
toSelf-hosted runners
- Check
Any account
forWhere can this GitHub App be installed?
- In
- From the GitHub App (
https://github.com/settings/apps/{your-app-name}
):- Create and store a
client secret
- Create and store a
private key
- Store the
app id
and theclient id
- Generate, set and store the
Webhook secret
. It will needed for part 4 of the setup. - Install the GitHub App in your GitHub Organization using
https://github.com/settings/apps/{your-app-name}/installations
. You will then land on the installed GitHub App web page (the url should look likehttps://github.com/settings/installations/{installation_id}
). Store theinstallation_id
.
- Create and store a
To address a specific GCP projet:
- Create a GCP project
- Create a GCP service account for this project with the
owner
role - Download the json key file associated with the service account. You will set it as terraform
variable google.credentials_json_b64
- Enable Compute Engine API on the project
- Enable Identity and Access Management (IAM) API on the project
- Create an App Engine Application at
https://console.cloud.google.com/appengine/start?project={your-gcp-project}
with the corresponding zone you want to use in your GCP project
From now you have everything to deploy the infrastructure on GCP using Terraform and Packer. For Terraform, you can refer to inputs to set project input variables. For Packer, you can refer to Packer json config located at runner.json
Nevertheless, deploying the infrastructure can be a tedious task, as two tools are involved (Terraform and Packer) and they share a lot parameters.
To ease deploying a set a scripts is available in tools folder. From here, deploy.sh
and destroy.sh
do as they are named, deploying and destroying the whole GCP infrastructure easely. To use these two scripts you need to store your Terraform variables in files with a specific layout. The scripts will load these variables and use them in Terrafrom and Packer. The layout to use is:
- A
google
tfvars file containing GCP related Terraform variables:
{
"google": {
"region" : "a-gcp-region",
...
},
"runner": {
"total_count" : 10,
...
},
"scaling": {
"idle_count" : 2,
...
}
}
- A
github
tfvars file containing GitHub related Terrafrom variables:
{
"github": {
"organisation" : "a-github-organization"
...
}
- A
terraform backend
tfvars file, passed to Terraform via-backend-config
option
From now, you can use deploy script.
After deployement, Terraform will print the github_webhook_url
needed for part 4 of the setup
- Enable the GitHub webhook, from the GitHub App (
https://github.com/settings/apps/{your-app}
):- In
Webhook
, checkActive
, and setWebhook URL
togithub_webhook_url
from part 3 of the setup - Go to
https://github.com/settings/apps/{your-app}/permissions
, inSubscribe to events
, checkCheck run
- In
- Enable ghost runner (see architecture for more info about ghost runner):
- From
https://console.cloud.google.com/cloudscheduler?project={your-gcp-project}
, executehealthcheck
- Wait the ghost runner to appears
offline
inhttps://github.com/organizations/{your-org}/settings/actions
- From
You are all set!
Project is made of a Terraform project and a Packer project. Packer is used to deploy the self hosted base GCE image. Terraform is used to deploy the components needed to manage the self hosted runners. After deployement the project behavior can be monitored from Google Cloud Monitoring. Using Cloud Scheduler you trigger manually health-checks, runners renewal and scale down. Manually deleting a VM will automatically unregister it from GitHub (it wont appear offline from GitHub perspective).
Packer needs to be deployed before the Terraform project, as the latter uses the image build and deployed by the former. Packer deployement can be done using:
- Deploy script (recommanded)
- Packer CLI
- packer.sh helper script
usage: image/packer.sh [--env-file google-env-file.json] [--packer-action build]
Terraform deployement can be done using:
- Deploy script (recommanded)
- Terraform CLI
deploy.sh script is made to ease deployment of the whole system (Packer and Terraform). It ensure Packer image is deployed before triggering Terraform and is able to update Packer image if wanted.
usage: .tools/deploy.sh { --google-env-file google-env-file.json --github-env-file github-env-file.json --backend-config-file backend.json } [ --skip-packer-deploy ] [ --skip-terraform-deploy ] [ --auto-approve ]
Image generated by Packer has to be destroyed manually directly in the GCP project. It can be done using gcloud CLI or GCP web interface. Infrastucture deployed by Terraform can be destroyed using:
- Destroy script (recomanded)
usage: .tools/destroy.sh { --google-env-file google-env-file.json --github-env-file github-env-file.json --backend-config-file backend.json }
main branch is considered stable and releases will be made from it.
The project is made of two systems, a GCP project, and a GitHub App, communicating together to manage and scale a GitHub Self Hosted Runner pool (GCP VMs) used by the GitHub organization where the GitHub App is installed. To implement scalability the GitHub App forward check_run
events from the organization repositories to the GCP project, received by the Github hook component. This component then filter the events and may choose to forward it to the start-and-stop component, responsible of managing and scaling the runners (VMs). The start-and-stop component may then choose to scale up the runners pool, according to its scaling up policy. At the moment, GitHub API does not produce event to inform the end of GitHub Action run, so to implement the scale down, the [start-and-stop] component is trigger regularly through a cloud scheduler, to check if runners can be scaled down, using its scale down policy. Communicating from the GCP project to GitHub is made through the proxy GitHub API component, allowing multiple components of the GCP project to address the GitHub API. Authentication between GCP and GitHub is made trough secrets stored in the seccret manager component. No sensitive data is leaked trough cloud function environement variable, for instance. Last but not least, the start-and-stop component has a retry behavior and follow idempotency principle. It is also regularly scheduled to execute health-checks on the whole system.
Packer project is located under ./image folder. Terraform project is made of a root module located at the root of the project and of child modules located under modules folder. Terraform version is set through specific version.
GitHub Action allow the usage of self hosted runner. This project deploy a infrastructure in GCP which is able to manage self hosted runners connected to a GitHub organization. This infrastructure is able to scale the number of runners according to the needs of the organization repositories in term of GitHub Action. This is done by setting a GitHub App at organization level, which is allowed to manage the organization runners and receive various GitHub events from the organizations repositories.
Located under start-and-stop Terraform child module. It is made of an event based (a Pub/Sub) cloud function, and a cloud scheduler. The function is responsible to apply core operations on runners (compute instances). It handle scaling operation, healthchecks, and runner renewal. It is triggered by external events coming from [github-hook] or the cloud scheduler.
Located under github-hook Terraform child module. An HTTP cloud function receiving event from the GitHub App, to detect when a scaling up should be evaluated.
Located under github-api Terraform child module. An HTTP cloud function used as a proxy for GCP components to call the GitHub API. It is used by start-and-stop to monitor runners from a GitHub perspective, and by runners themself to register/unregister themself as self hosted runner to the GitHub organization when starting/stopping.
Located under secrets Terraform child module. All sensible information used by the different components in the GCP project are stored and retrieved using Secret Manager. In particulary sensitive data for authentication with the GitHub API.
The GitHub Runners are VMs managed by the start-and-stop component. These VM automatically register as Runner to GitHub at start and unregister at stop.
Located under image. The base runner image is build and deploy using Packer. The image is made of the software stack needed by the Runner (Docker, StackDriver, GitHub Action Runner). You can customize the base software installed modifying init.sh script. Thanks to this base image, the amount of time needed between the moment the system trigger a scale up and the moment the resulting GitHub Runner is available on GitHub is reduced (only need to create and start the VM from this base image).
The key feature of this system is its ability to scale up and down according the needs of your GitHub organization in term of GitHub Action.
Scale up is trigger by check_run
events, with the status
queued
. This event is trigger when a GitHub Action workflow is trigger (a workflow run). When receiving this event, the start-and-stop component will compare the count of inactive runners with scaling.up_rate
. If this count is lower than the rate, the component will scale up the number of runners to match the scaling.up_rate
. At the moment GitHub webhook does not expose the number of runners a workflow run could need. For instance in case of a matrix, the workflow run could be parallelized between multiple runners. In consequence, depending your usage of GitHub Action, you may want to use a scaling.up_rate
higher than one.
GitHub API does not expose webhook event when a GitHub Action workflow run is ended. To trigger the scaling down of the runners, a cloud scheduler event is used, triggered by the cron scaling.down_schedule
. When receiving this event, the start-and-stop component will count the number of inactive runners and will try to scale them down by scaling.down_rate
. The number of runners scalled down may be lower depending the idle-runner policy.
By default scale down is triggered every 10 minutes terraform.tfvars.json
. You can manually trigger it at https://console.cloud.google.com/cloudscheduler?project=your-gcp-project
.
Idle runner feature allow to prevent a certain count (scaling.idle_count
) of inactive runners to be scaled down during a given period of time defined by the cron(scaling.idle_schedule). The idle_schedule
follow the cron syntax and is computed with the timezone provided by google.time_zone
.
For instance * 8-18 * * 1-5
will allow to apply the idle runner policy from 8:00 to 18:59, from Monday to Friday.
GitHub Action will fail a workflow if it can not find a suitable runner (a runner matching [runs-on
](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idruns-on defined by the workflow). A suitable runner always needs to be registred to GitHub, but it does not need to be actually online. This is why the system set up a special runner, a ghost runner which will be registered to GitHub with the labels use by the system, but wont actually exist. With the ghost runner the system is able to scale down to 0 runner.
The system is made with stability and resilience in mind. start-and-stop respects idempotency principle and is able to retry itself in case of error.
Stack Driver agent is enable on VMs, allowing to precisly monitor them (ressources usage, logs, errors).
Health checks are regularly triggered on the system by the cron triggers.healthcheck_schedule
, allowing it to recover from major error. For instance the healthchecks are able to detect and fix:
- Missing ghost runner
- Offline runner on GitHub, probably runner which encountered a major failure during a workflow.
- Unknown runner on GitHub, probably runner which encountered a major failure during its startup sequence.
By default those checks are triggered once a day
terraform.tfvars.json
. You can manually trigger them athttps://console.cloud.google.com/cloudscheduler?project=your-gcp-project
.
Runner renewal is regularly triggered on the system by the cron triggers.renew_schedule
to ensure runners do not stay alive too much time (the longer a runner stay alive, the higher the chance are it fails).
By default those checks are triggered once a day terraform.tfvars.json
. You can manually trigger them at https://console.cloud.google.com/cloudscheduler?project=your-gcp-project
.
To be able to compute the cost of the system, all the resources used by the GCP project must be taken in account. In our case we will have:
- Compute for the VMs
- Network for the VMs
- Storage for the VMs
- Cloud function
- Secret manager for managing the secrets
Cost can be simplified to compute cost given:
- Network would be essentialy ingress, which is free.
- Storage will only serve for cloud function source code, the base image runner, and the disks used by the runners, which is negligeable vs Compute cost.
- Cloud function needs in term of computation is very low, so we are using minimum resource possible. This is negliseable vs Compute cost.
- Secret manager usage is fairly limited to cloud functions and VMs, which is negligeable vs Compute cost.
The only cost which would not be negligeable vs Compute cost is if your GitHub Action workflow would imply a lot of egress traffic, like pushing heavy artifacts outside of GCP. To represent the moment where this solution is cost efficent vs GitHub Action Hosted Runners, we can use the following inequation:
W
the number of minutes of CI by monthO
the number of minutes where VMs do nothing by monthCgcp
the cost by minutes of a N1 GCP VM 0.0016$ (the same amount of performance than GitHub Hosted-Runner)Cgithub
the cost by minutes of a GitHub Hosted-Runner (0.008$)
Which can be simplified by:
The system is performant if the time spent inactive by the runners is less than 4 times the time spent executing workflow.
- Bash (automation)
- Terraform and Packer (deployement of the infrastructure)
- Docker (run the CI locally)
- NodeJs (the cloud function environement)
Runner
: Usually a virtual machine used by the GitHub Action API to schedule workflow run, through the GitHub Action softwareGitHub Action Hosted Runner
: The runners provided by GitHub, free for open source project and billed for private project.The system
: Represent the solution as a whole proposed by this project (GCP project and GitHub App).Component
: Represent a dedicated entity in GCP responsible of doing one thing. Usually can be mapped to a Terraform child module.Workflow
: A GitHub Action automated process to host CI/CD logic.Inactive Runner
: A VM started and not executing any workflow. Also named non busy is some part of the code.Idle Runner
: A Runner which can not be scaled down because of idle policy
All cron are evaluated with the timezone google.time_zone
.
We are using following cron syntax:
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of the month (1 - 31)
│ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
│ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *
Name | Version |
---|---|
terraform | >= 0.13 |
Name | Version |
---|---|
n/a |
Name | Description | Type | Default | Required |
---|---|---|---|---|
github | Represents the GitHub App installed in the GitHub organization where the runners from GCP will serve as self hosted runners. The GitHub App allows communication between GitHub organization and GCP project. Check GitHub setup.organisation : The GitHub organization (the name, for instance fabernovel for FABERNOVEL) where runners will be available.app_id : The id of the GitHub App. Available at https://github.com/organizations/{org}/settings/apps/{app} .app_installation_id : The installation id of the GitHub App whitin the organization. Available at https://github.com/organizations/{org}/settings/installations , Configure , app_installation_id is then in the url as https://github.com/organizations/faberNovel/settings/installations/{app_installation_id} .client_id : The client id of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app} .client_secret : The client secret of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app} .key_pem_b64 : The private key of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app} .webhook_secret : The webhook of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app} . |
object({ |
n/a | yes |
Represents the GCP project hosting the virtual machines acting as GitHub Action self hosted runners. Check GCP setup.project : The project ID of the GCP project.region : The region of the GCP project.zone : The zone of the GCP project.credentials_json_b64 : The content in b64 of the service account keys used by Terraform to manipulate the GCP project.env : A label used to tag GCP ressources. taint_labels uses this value to taind runners labels.time_zone : The time zone to use in the project as described by TZ database. idle_schedule , down_schedule , healthcheck_schedule , renew_schedule are evaluated in this time zone. |
object({ |
n/a | yes | |
runner | type : The machine type of the runners, for instance n1-standard-2 .taint_labels : Enable tainting runner labels, useful to not mix debug and prod runner for your organization |
object({ |
n/a | yes |
scaling | idle_count : The number of runners to keep idle.idle_schedule : A cron describing the idling period of runners. Syntax.up_rate : The number of runners to create when scaling up.up_max : The maximum number of runners.down_rate : The number of inative runners to delete when scaling down.down_schedule : A cron to trigger regularly scaling down. Syntax. |
object({ |
n/a | yes |
triggers | healthcheck_schedule : A cron to trigger health checks. Syntax.renew_schedule : A cron to trigger runners renewal. Syntax. |
object({ |
n/a | yes |
Name | Description |
---|---|
github_webhook_url | n/a |
We love to hear your input whether it's about bug reporting, proposing fix or new features, or general discussion about the system behavior.
To do so we are using Github, so all changes must be done using Pull Request.
To report issues, use GitHub issues. Try to wrote bug reports with detail, background, and if applicable some code. Try to be specific and include steps to reproduce.
The codestyle used by the project is transparent. To be sure you are respecting it, you can execute the CI locally using run-all.sh
script.
The project is liscended under the GNU General Public License v3.0