Skip to content

faberNovel/terraform-gcp-github-runner

Repository files navigation

Terraform / Packer project for scalable self hosted GitHub action runners on GCP 🚀

This project leverages Terraform and Packer to deploy and maintain scalable self hosted GitHub actions runners infrastructure on GCP for a GitHub organization.

Table of contents

Setup

To setup the infrastucture, here are the manual steps which can not be done through Terraform:

  1. Setup a GitHub App which will be installed in the GitHub organization where self hosted runners will be available. GitHub setup section explains how to setup this GitHub App
  2. Setup a GCP project which will host the self hosted runners and scaling logic must be created. GCP setup section explains how to setup this GCP project
  3. Deploy the infrastructure on GCP using Terraform using Packer and Terraform
  4. Update GitHup App Webhook setting with GCP project information computed during GCP deployment

1-GitHub Setup

This section explains how to setup the GitHub App in the GitHub organization where we want to use self hosted runners. The GitHub App will allow GCP to communicate with the GitHub organization in order to create / delete runners and scaling up / down when needed. Check for components scheme for more information. To be able to properly setup the GitHup App in the GitHub organization, you need to be an admin of this GitHub organization.

  • Create a GitHub App:
    • In Webhook, uncheck Active, and leave Webhook URL empty for now. It will be enable in part 4 of the setup.
    • In Repository permissions grant Read-only permission to Checks and Metadata (needed to forward scale up event from GitHub towards GCP)
    • In Organization permissions, grant Read & write to Self-hosted runners
    • Check Any account for Where can this GitHub App be installed?
  • From the GitHub App (https://github.com/settings/apps/{your-app-name}):
    • Create and store a client secret
    • Create and store a private key
    • Store the app id and the client id
    • Generate, set and store the Webhook secret. It will needed for part 4 of the setup.
    • Install the GitHub App in your GitHub Organization using https://github.com/settings/apps/{your-app-name}/installations. You will then land on the installed GitHub App web page (the url should look like https://github.com/settings/installations/{installation_id}). Store the installation_id.

2-Google Cloud Platform Setup

To address a specific GCP projet:

3-Deploy the infrastructure on GCP using Terraform/Packer

From now you have everything to deploy the infrastructure on GCP using Terraform and Packer. For Terraform, you can refer to inputs to set project input variables. For Packer, you can refer to Packer json config located at runner.json

Nevertheless, deploying the infrastructure can be a tedious task, as two tools are involved (Terraform and Packer) and they share a lot parameters. To ease deploying a set a scripts is available in tools folder. From here, deploy.sh and destroy.sh do as they are named, deploying and destroying the whole GCP infrastructure easely. To use these two scripts you need to store your Terraform variables in files with a specific layout. The scripts will load these variables and use them in Terrafrom and Packer. The layout to use is:

  • A google tfvars file containing GCP related Terraform variables:
{
  "google": {
    "region" : "a-gcp-region",
    ...
  },

  "runner": {
    "total_count" : 10,
    ...
  },

  "scaling": {
    "idle_count" : 2,
    ...
  }
}
  • A github tfvars file containing GitHub related Terrafrom variables:
{
  "github": {
    "organisation" : "a-github-organization"
    ...
}

From now, you can use deploy script.

After deployement, Terraform will print the github_webhook_url needed for part 4 of the setup

4-Post deployement steps

  • Enable the GitHub webhook, from the GitHub App (https://github.com/settings/apps/{your-app}):
    • In Webhook, check Active, and set Webhook URL to github_webhook_url from part 3 of the setup
    • Go to https://github.com/settings/apps/{your-app}/permissions, in Subscribe to events, check Check run
  • Enable ghost runner (see architecture for more info about ghost runner):
    • From https://console.cloud.google.com/cloudscheduler?project={your-gcp-project}, execute healthcheck
    • Wait the ghost runner to appears offline in https://github.com/organizations/{your-org}/settings/actions

You are all set!

Usage

Project is made of a Terraform project and a Packer project. Packer is used to deploy the self hosted base GCE image. Terraform is used to deploy the components needed to manage the self hosted runners. After deployement the project behavior can be monitored from Google Cloud Monitoring. Using Cloud Scheduler you trigger manually health-checks, runners renewal and scale down. Manually deleting a VM will automatically unregister it from GitHub (it wont appear offline from GitHub perspective).

Deploy

Packer needs to be deployed before the Terraform project, as the latter uses the image build and deployed by the former. Packer deployement can be done using:

usage: image/packer.sh [--env-file google-env-file.json] [--packer-action build]

Terraform deployement can be done using:

Deploy script

deploy.sh script is made to ease deployment of the whole system (Packer and Terraform). It ensure Packer image is deployed before triggering Terraform and is able to update Packer image if wanted.

usage: .tools/deploy.sh { --google-env-file google-env-file.json --github-env-file github-env-file.json --backend-config-file backend.json } [ --skip-packer-deploy ] [ --skip-terraform-deploy ] [ --auto-approve ]

Destroy

Image generated by Packer has to be destroyed manually directly in the GCP project. It can be done using gcloud CLI or GCP web interface. Infrastucture deployed by Terraform can be destroyed using:

usage: .tools/destroy.sh { --google-env-file google-env-file.json --github-env-file github-env-file.json --backend-config-file backend.json }

Releases

main branch is considered stable and releases will be made from it.

Architecture

The project is made of two systems, a GCP project, and a GitHub App, communicating together to manage and scale a GitHub Self Hosted Runner pool (GCP VMs) used by the GitHub organization where the GitHub App is installed. To implement scalability the GitHub App forward check_run events from the organization repositories to the GCP project, received by the Github hook component. This component then filter the events and may choose to forward it to the start-and-stop component, responsible of managing and scaling the runners (VMs). The start-and-stop component may then choose to scale up the runners pool, according to its scaling up policy. At the moment, GitHub API does not produce event to inform the end of GitHub Action run, so to implement the scale down, the [start-and-stop] component is trigger regularly through a cloud scheduler, to check if runners can be scaled down, using its scale down policy. Communicating from the GCP project to GitHub is made through the proxy GitHub API component, allowing multiple components of the GCP project to address the GitHub API. Authentication between GCP and GitHub is made trough secrets stored in the seccret manager component. No sensitive data is leaked trough cloud function environement variable, for instance. Last but not least, the start-and-stop component has a retry behavior and follow idempotency principle. It is also regularly scheduled to execute health-checks on the whole system.

Project layout

Packer project is located under ./image folder. Terraform project is made of a root module located at the root of the project and of child modules located under modules folder. Terraform version is set through specific version.

How it works

GitHub Action allow the usage of self hosted runner. This project deploy a infrastructure in GCP which is able to manage self hosted runners connected to a GitHub organization. This infrastructure is able to scale the number of runners according to the needs of the organization repositories in term of GitHub Action. This is done by setting a GitHub App at organization level, which is allowed to manage the organization runners and receive various GitHub events from the organizations repositories.

Component scheme

Architecture

Components

Start and Stop

Located under start-and-stop Terraform child module. It is made of an event based (a Pub/Sub) cloud function, and a cloud scheduler. The function is responsible to apply core operations on runners (compute instances). It handle scaling operation, healthchecks, and runner renewal. It is triggered by external events coming from [github-hook] or the cloud scheduler.

GitHub hook

Located under github-hook Terraform child module. An HTTP cloud function receiving event from the GitHub App, to detect when a scaling up should be evaluated.

GitHub API

Located under github-api Terraform child module. An HTTP cloud function used as a proxy for GCP components to call the GitHub API. It is used by start-and-stop to monitor runners from a GitHub perspective, and by runners themself to register/unregister themself as self hosted runner to the GitHub organization when starting/stopping.

Secret Manager

Located under secrets Terraform child module. All sensible information used by the different components in the GCP project are stored and retrieved using Secret Manager. In particulary sensitive data for authentication with the GitHub API.

GitHub Runners

The GitHub Runners are VMs managed by the start-and-stop component. These VM automatically register as Runner to GitHub at start and unregister at stop.

Base Runner Image

Located under image. The base runner image is build and deploy using Packer. The image is made of the software stack needed by the Runner (Docker, StackDriver, GitHub Action Runner). You can customize the base software installed modifying init.sh script. Thanks to this base image, the amount of time needed between the moment the system trigger a scale up and the moment the resulting GitHub Runner is available on GitHub is reduced (only need to create and start the VM from this base image).

Scaling

The key feature of this system is its ability to scale up and down according the needs of your GitHub organization in term of GitHub Action.

Scale Up

Scale up is trigger by check_run events, with the status queued. This event is trigger when a GitHub Action workflow is trigger (a workflow run). When receiving this event, the start-and-stop component will compare the count of inactive runners with scaling.up_rate. If this count is lower than the rate, the component will scale up the number of runners to match the scaling.up_rate. At the moment GitHub webhook does not expose the number of runners a workflow run could need. For instance in case of a matrix, the workflow run could be parallelized between multiple runners. In consequence, depending your usage of GitHub Action, you may want to use a scaling.up_rate higher than one.

Scale Down

GitHub API does not expose webhook event when a GitHub Action workflow run is ended. To trigger the scaling down of the runners, a cloud scheduler event is used, triggered by the cron scaling.down_schedule. When receiving this event, the start-and-stop component will count the number of inactive runners and will try to scale them down by scaling.down_rate. The number of runners scalled down may be lower depending the idle-runner policy. By default scale down is triggered every 10 minutes terraform.tfvars.json. You can manually trigger it at https://console.cloud.google.com/cloudscheduler?project=your-gcp-project.

Idle Runner

Idle runner feature allow to prevent a certain count (scaling.idle_count) of inactive runners to be scaled down during a given period of time defined by the cron(scaling.idle_schedule). The idle_schedule follow the cron syntax and is computed with the timezone provided by google.time_zone. For instance * 8-18 * * 1-5 will allow to apply the idle runner policy from 8:00 to 18:59, from Monday to Friday.

Ghost runner

GitHub Action will fail a workflow if it can not find a suitable runner (a runner matching [runs-on](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idruns-on defined by the workflow). A suitable runner always needs to be registred to GitHub, but it does not need to be actually online. This is why the system set up a special runner, a ghost runner which will be registered to GitHub with the labels use by the system, but wont actually exist. With the ghost runner the system is able to scale down to 0 runner.

System health

The system is made with stability and resilience in mind. start-and-stop respects idempotency principle and is able to retry itself in case of error.

Monitoring

Stack Driver agent is enable on VMs, allowing to precisly monitor them (ressources usage, logs, errors).

Health checks

Health checks are regularly triggered on the system by the cron triggers.healthcheck_schedule, allowing it to recover from major error. For instance the healthchecks are able to detect and fix:

  • Missing ghost runner
  • Offline runner on GitHub, probably runner which encountered a major failure during a workflow.
  • Unknown runner on GitHub, probably runner which encountered a major failure during its startup sequence. By default those checks are triggered once a day terraform.tfvars.json. You can manually trigger them at https://console.cloud.google.com/cloudscheduler?project=your-gcp-project.

Runner renewal

Runner renewal is regularly triggered on the system by the cron triggers.renew_schedule to ensure runners do not stay alive too much time (the longer a runner stay alive, the higher the chance are it fails). By default those checks are triggered once a day terraform.tfvars.json. You can manually trigger them at https://console.cloud.google.com/cloudscheduler?project=your-gcp-project.

Cost

To be able to compute the cost of the system, all the resources used by the GCP project must be taken in account. In our case we will have:

Cost can be simplified to compute cost given:

  • Network would be essentialy ingress, which is free.
  • Storage will only serve for cloud function source code, the base image runner, and the disks used by the runners, which is negligeable vs Compute cost.
  • Cloud function needs in term of computation is very low, so we are using minimum resource possible. This is negliseable vs Compute cost.
  • Secret manager usage is fairly limited to cloud functions and VMs, which is negligeable vs Compute cost.

The only cost which would not be negligeable vs Compute cost is if your GitHub Action workflow would imply a lot of egress traffic, like pushing heavy artifacts outside of GCP. To represent the moment where this solution is cost efficent vs GitHub Action Hosted Runners, we can use the following inequation:

  • W the number of minutes of CI by month
  • O the number of minutes where VMs do nothing by month
  • Cgcp the cost by minutes of a N1 GCP VM 0.0016$ (the same amount of performance than GitHub Hosted-Runner)
  • Cgithub the cost by minutes of a GitHub Hosted-Runner (0.008$)
    Inequation
    Which can be simplified by:
    Inequation
    The system is performant if the time spent inactive by the runners is less than 4 times the time spent executing workflow.

Requirement

  • Bash (automation)
  • Terraform and Packer (deployement of the infrastructure)
  • Docker (run the CI locally)
  • NodeJs (the cloud function environement)

Glossary

  • Runner: Usually a virtual machine used by the GitHub Action API to schedule workflow run, through the GitHub Action software
  • GitHub Action Hosted Runner: The runners provided by GitHub, free for open source project and billed for private project.
  • The system: Represent the solution as a whole proposed by this project (GCP project and GitHub App).
  • Component: Represent a dedicated entity in GCP responsible of doing one thing. Usually can be mapped to a Terraform child module.
  • Workflow: A GitHub Action automated process to host CI/CD logic.
  • Inactive Runner: A VM started and not executing any workflow. Also named non busy is some part of the code.
  • Idle Runner: A Runner which can not be scaled down because of idle policy

Cron syntax

All cron are evaluated with the timezone google.time_zone. We are using following cron syntax:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of the month (1 - 31)
│ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
│ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
│ │ │ │ │                                   
│ │ │ │ │
│ │ │ │ │
* * * * *

Requirements

Name Version
terraform >= 0.13

Providers

Name Version
google n/a

Inputs

Name Description Type Default Required
github Represents the GitHub App installed in the GitHub organization where the runners from GCP will serve as self hosted runners. The GitHub App allows communication between GitHub organization and GCP project. Check GitHub setup.

organisation: The GitHub organization (the name, for instance fabernovel for FABERNOVEL) where runners will be available.

app_id: The id of the GitHub App. Available at https://github.com/organizations/{org}/settings/apps/{app}.

app_installation_id: The installation id of the GitHub App whitin the organization. Available at https://github.com/organizations/{org}/settings/installations, Configure, app_installation_id is then in the url as https://github.com/organizations/faberNovel/settings/installations/{app_installation_id}.

client_id: The client id of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app}.

client_secret: The client secret of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app}.

key_pem_b64: The private key of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app}.

webhook_secret: The webhook of the GitHub App. Managable at https://github.com/organizations/{org}/settings/apps/{app}.
object({
organisation = string
app_id = string
app_installation_id = string
client_id = string
client_secret = string
key_pem_b64 = string
webhook_secret = string
})
n/a yes
google Represents the GCP project hosting the virtual machines acting as GitHub Action self hosted runners. Check GCP setup.

project: The project ID of the GCP project.

region: The region of the GCP project.

zone: The zone of the GCP project.

credentials_json_b64: The content in b64 of the service account keys used by Terraform to manipulate the GCP project.

env: A label used to tag GCP ressources. taint_labels uses this value to taind runners labels.

time_zone: The time zone to use in the project as described by TZ database. idle_schedule, down_schedule, healthcheck_schedule, renew_schedule are evaluated in this time zone.
object({
project = string
region = string
zone = string
credentials_json_b64 = string
env = string
time_zone = string
})
n/a yes
runner type: The machine type of the runners, for instance n1-standard-2.

taint_labels: Enable tainting runner labels, useful to not mix debug and prod runner for your organization
object({
type = string
taint_labels = bool
})
n/a yes
scaling idle_count: The number of runners to keep idle.

idle_schedule: A cron describing the idling period of runners. Syntax.

up_rate: The number of runners to create when scaling up.

up_max: The maximum number of runners.

down_rate: The number of inative runners to delete when scaling down.

down_schedule: A cron to trigger regularly scaling down. Syntax.
object({
idle_count = number
idle_schedule = string
up_rate = number
up_max = number
down_rate = number
down_schedule = string
})
n/a yes
triggers healthcheck_schedule: A cron to trigger health checks. Syntax.

renew_schedule: A cron to trigger runners renewal. Syntax.
object({
healthcheck_schedule = string
renew_schedule = string
})
n/a yes

Outputs

Name Description
github_webhook_url n/a

Contributing

We love to hear your input whether it's about bug reporting, proposing fix or new features, or general discussion about the system behavior. To do so we are using Github, so all changes must be done using Pull Request. To report issues, use GitHub issues. Try to wrote bug reports with detail, background, and if applicable some code. Try to be specific and include steps to reproduce. The codestyle used by the project is transparent. To be sure you are respecting it, you can execute the CI locally using run-all.sh script.

Similar projects

License

The project is liscended under the GNU General Public License v3.0