Skip to content

Latest commit

 

History

History
651 lines (493 loc) · 70.3 KB

README.md

File metadata and controls

651 lines (493 loc) · 70.3 KB

Terraform module for scalable self hosted GitHub action runners

awesome-runnersTerraform registry Terraform checks Lambdas

This Terraform module creates the required infrastructure needed to host GitHub Actions self-hosted, auto-scaling runners on AWS spot instances. It provides the required logic to handle the life cycle for scaling up and down using a set of AWS Lambda functions. Runners are scaled down to zero to avoid costs when no workflows are active.

📢 We maintain the project as a thruly open-source project. We maintain the project on best effor. We welcome contributions from the community. Feel free to help us answering issues, reviewing PR's, maintain and improve the project.

📢 v5 replaces Amazon Linux 2 by Amazon Linux 2023 as default OS. Check the PR for more details and other changes.

📢 For contibutions to older versions you can make a PR to the related branch, e.g. v4. We have no release process in place for older versions.

📢 HELP WANTED: We have been running the AWS self-hosted GitHub runners OS project in Philips Labs for over two years! And we are incredibly happy with all the feedback and contribution of the open-source community. In the next months we will speak at some conferences to share the solution and story of running this open-source project. Via this questionaire we would like to gather feedback from the community to use in our talks.

Motivation

GitHub Actions self-hosted runners provide a flexible option to run CI workloads on the infrastructure of your choice. Currently, no option is provided to automate the creation and scaling of action runners. This module creates the AWS infrastructure to host action runners on spot instances. It provides lambda modules to orchestrate the life cycle of the action runners.

Lambda is chosen as the runtime for two major reasons. First, it allows the creation of small components with minimal access to AWS and GitHub. Secondly, it provides a scalable setup with minimal costs that works on repo level and scales to organization level. The lambdas will create Linux based EC2 instances with Docker to serve CI workloads that can run on Linux and/or Docker. The main goal is to support Docker-based workloads.

A logical question would be, why not Kubernetes? In the current approach, we stay close to how the GitHub action runners are implemented today. The approach is to install the runner on a host where the required software is available. With this setup, we stay quite close to the current GitHub approach. Another logical choice would be AWS Auto Scaling groups. However, this choice would typically require much more permissions at the instance level to GitHub. And besides that, scaling up and down is not trivial.

Overview

The moment a GitHub action workflow requiring a self-hosted runner is triggered, GitHub will try to find a runner which can execute the workload. See additional notes for how the selection is made. This module reacts to GitHub's workflow_job event for the triggered workflow and creates a new runner if necessary.

For receiving the workflow_job event by the webhook (lambda), a webhook needs to be created in GitHub. The check_run option was dropped from version 2.x. The following options to send the event are supported.

  • Create a GitHub app, define a webhook and subscribe the app to the workflow_job event.
  • Create a webhook on enterprise, org or repo level, define a webhook and subscribe the app to the workflow_job event.

In AWS an API gateway endpoint is created that is able to receive the GitHub webhook events via HTTP post. The gateway triggers the webhook lambda which will verify the signature of the event. This check guarantees the event is sent by the GitHub App. The lambda only handles workflow_job events with status queued and matching the runner labels. The accepted events are posted on a SQS queue. Messages on this queue will be delayed for a configurable amount of seconds (default 30 seconds) to give the available runners time to pick up this build.

The "scale up runner" lambda listens to the SQS queue and picks up events. The lambda runs various checks to decide whether a new EC2 spot instance needs to be created. For example, the instance is not created if the build is already started by an existing runner, or the maximum number of runners is reached.

The Lambda first requests a JIT configuration or registration token from GitHub, which is needed later by the runner to register itself. This avoids the case that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a user_data script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.

Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy. In case the runner is not busy it will be removed from GitHub and the instance terminated in AWS. At the moment there seems to be no other option to scale down more smoothly.

Downloading the GitHub Action Runner distribution can occasionally be slow (more than 10 minutes). Therefore a lambda is introduced that synchronizes the action runner binary from GitHub to an S3 bucket. The EC2 instance will fetch the distribution from the S3 bucket instead of the internet.

Secrets and private keys are stored in SSM Parameter Store. These values are encrypted using the default KMS key for SSM or passing in a custom KMS key.

Architecture

Permission are managed in several places. Below are the most important ones. For details check the Terraform sources.

  • The GitHub App requires access to actions and to publish workflow_job events to the AWS webhook (API gateway).
  • The scale up lambda should have access to EC2 for creating and tagging instances.
  • The scale down lambda should have access to EC2 to terminate instances.

Besides these permissions, the lambdas also need permission to CloudWatch (for logging and scheduling), SSM and S3. For more details about the required permissions see the documentation of the IAM module which uses permission boundaries.

Major configuration options.

To be able to support a number of use-cases the module has quite a lot of configuration options. We try to choose reasonable defaults. Several examples also show the main cases of how to configure the runners.

  • Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org. Or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
  • Multi-Runner module. This modules allows you to create multiple runner configurations with a single webhook and single GitHub App to simplify deployment of different types of runners. Refer to the ReadMe for more information to understand the functionality.
  • Workflow job event. You can configure the webhook in GitHub to send workflow job events to the webhook. Workflow job events were introduced by GitHub in September 2021 and are designed to support scalable runners. We advise using the workflow job event when possible.
  • Linux vs Windows. You can configure the OS types linux and win. Linux will be used by default.
  • Re-use vs Ephemeral. By default runners are re-used, until detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. For ephemeral runners the lambda requests a JIT (just in time) configuration via the GitHub API to register the runner. JIT configuration is limited to ephemeral runners (and currently not supported by GHES). For non-ephemeral a registration token is requested always. In both cases the configuration is made available to the instance via the same SSM parameter. To disable JIT configuration for ephermeral runners set enable_jit_config to false. We also suggest using a pre-build AMI to improve the start time of jobs for ephemeral runners.
  • GitHub Cloud vs GitHub Enterprise Server (GHES). The runners support GitHub Cloud as well GitHub Enterprise Server. For GHES we rely on our community for support and testing. We have no possibility to test ourselves on GHES.
  • Spot vs on-demand. The runners use either the EC2 spot or on-demand life cycle. Runners will be created via the AWS CreateFleet API. The module (scale up lambda) will request via the CreateFleet API to create instances in one of the subnets and of the specified instance types.
  • ARM64 support via Graviton/Graviton2 instance-types. When using the default example or top-level module, specifying instance_types that match a Graviton/Graviton 2 (ARM64) architecture (e.g. a1, t4g or any 6th-gen g or gd type), you must also specify runner_architecture = "arm64" and the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.

AWS SSM Parameters

The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable ssm_paths. The location of the configuration parameters is retrieved by the runners via the instance tag ghr:ssm_config_path. The following default paths will be used. Tokens or JIT config stored in the token path will be deleted after retrieval by instance, data not deleted after a day will be deleted by a SSM housekeeper lambda.

Path Description
ssm_paths.root/var.prefix?/app/ App secrets used by Lambda's
ssm_paths.root/var.prefix?/runners/config/<name> Configuration parameters used by runner start script
ssm_paths.root/var.prefix?/runners/tokens/<ec2-instance-id> Either JIT configuration (ephemeral runners) or registration tokens (non ephemeral runners) generated by the control plane (scale-up lambda), and consumed by the start script on the runner to activate / register the runner.

Available configuration parameters:

Parameter name Description
agent_mode Indicates if the agent is running in ephemeral mode or not.
enable_cloudwatch Configuration for the cloudwatch agent to stream logging.
run_as The user used for running the GitHub action runner agent.
token_path The path where tokens are stored.

Usages

Examples are provided in the example directory. Please ensure you have installed the following tools.

  • Terraform, or tfenv.
  • Bash shell or compatible
  • Docker (optional, to build lambdas without node).
  • AWS cli (optional)
  • Node and yarn (for lambda development).

The module supports two main scenarios for creating runners. Repository level runners will be dedicated to only one repository, no other repository can use the runner. At the organization level you can use the runner(s) for all repositories within the organization. See GitHub self-hosted runner instructions for more information. Before starting the deployment you have to choose one option.

The setup consists of running Terraform to create all AWS resources and manually configuring the GitHub App. The Terraform module requires configuration from the GitHub App and the GitHub app requires output from Terraform. Therefore you first create the GitHub App and configure the basics, then run Terraform, and afterwards finalize the configuration of the GitHub App.

Setup GitHub App (part 1)

Go to GitHub and create a new app. Be aware you can create apps for your organization or for a user. For now we only support organization level apps.

  1. Create an app in Github
  2. Choose a name
  3. Choose a website (mandatory, not required for the module).
  4. Disable the webhook for now (we will configure this later or create an alternative webhook).
  5. Permissions for all runners:
    • Repository:
      • Actions: Read-only (check for queued jobs)
      • Checks: Read-only (receive events for new builds)
      • Metadata: Read-only (default/required)
  6. Permissions for repo level runners only:
    • Repository:
      • Administration: Read & write (to register runner)
  7. Permissions for organization level runners only:
    • Organization
      • Self-hosted runners: Read & write (to register runner)
  8. Save the new app.
  9. On the General page, make a note of the "App ID" and "Client ID" parameters.
  10. Generate a new private key and save the app.private-key.pem file.

Setup terraform module

Download lambdas

To apply the terraform module, the compiled lambdas (.zip files) need to be available either locally or in an S3 bucket. They can either be downloaded from the GitHub release page or built locally.

To read the files from S3, set the lambda_s3_bucket variable and the specific object key for each lambda.

The lambdas can be downloaded manually from the release page or using the download-lambda terraform module (requires curl to be installed on your machine). In the download-lambda directory, run terraform init && terraform apply. The lambdas will be saved to the same directory.

For local development you can build all the lambdas at once using .ci/build.sh or individually using yarn dist.

Service-linked role

To create spot instances the AWSServiceRoleForEC2Spot role needs to be added to your account. You can do that manually by following the AWS docs. To use terraform for creating the role, either add the following resource or let the module manage the service linked role by setting create_service_linked_role_spot to true. Be aware this is an account global role, so maybe you don't want to manage it via a specific deployment.

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

Terraform module

Next create a second terraform workspace and initiate the module, or adapt one of the examples.

Note that github_app.key_base64 needs to be a base64-encoded string of the .pem file i.e. the output of base64 app.private-key.pem. The decoded string can either be a multiline value or a single line value with new lines represented with literal \n characters.

module "github-runner" {
  source  = "philips-labs/github-runner/aws"
  version = "REPLACE_WITH_VERSION"

  aws_region = "eu-west-1"
  vpc_id     = "vpc-123"
  subnet_ids = ["subnet-123", "subnet-456"]

  prefix = "gh-ci"

  github_app = {
    key_base64     = "base64string"
    id             = "1"
    webhook_secret = "webhook_secret"
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners = true
}

Run terraform by using the following commands

terraform init
terraform apply

The terraform output displays the API gateway url (endpoint) and secret, which you need in the next step.

The lambda for syncing the GitHub distribution to S3 is triggered via CloudWatch (by default once per hour). After deployment the function is triggered via S3 to ensure the distribution is cached.

Setup the webhook / GitHub App (part 2)

At this point you have two options. Either create a separate webhook (enterprise, org, or repo), or create a webhook in the App.

Option 1: Webhook

  1. Create a new webhook at the repo level for repo level runners, or org (or enterprise level) for org level runners.
  2. Provide the webhook url, which should be part of the output of terraform.
  3. Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
  4. Ensure the content type is application/json.
  5. In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only one option!!!).
  6. In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Option 2: App

Go back to the GitHub App and update the following settings.

  1. Enable the webhook.
  2. Provide the webhook url, should be part of the output of terraform.
  3. Provide the webhook secret (terraform output -raw <NAME_OUTPUT_VAR>).
  4. In the "Permissions & Events" section and then "Subscribe to Events" subsection, check either "Workflow Job" or "Check Run" (choose only one option!!!).

Install app

Finally you need to ensure the app is installed to all or selected repositories.

Go back to the GitHub App and update the following settings.

  1. In the "Install App" section, install the App in your organization, either in all or in selected repositories.

Encryption

The module supports two scenarios to manage environment secrets and private keys of the Lambda functions.

Encrypted via a module managed KMS key (default)

This is the default, no additional configuration is required.

Encrypted via a provided KMS key

You have to create and configure you KMS key. The module will use the context with key: Environment and value var.environment as encryption context.

resource "aws_kms_key" "github" {
  is_enabled = true
}

module "runners" {

  ...
  kms_key_arn = aws_kms_key.github.arn
  ...

Pool

The module basically supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is keeping runners idle.

The pool is introduced in combination with the ephemeral runners and is primarily meant to ensure if any event is unexpectedly dropped and no runner was created the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is performed if the number of idle runners managed by the module is meeting the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected as idle.

pool_runner_owner = "my-org"                  # Org to which the runners are added
pool_config = [{
  size                = 20                    # size of the pool
  schedule_expression = "cron(* * * * ? *)"   # cron expression to trigger the adjustment of the pool
}]

The pool is NOT enabled by default and can be enabled by setting at least one object of the pool config list. The ephemeral example contains configuration options (commented out).

Idle runners

The module will scale down to zero runners by default. By specifying a idle_config config, idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a margin of 5 seconds. When there is a match, the number of runners specified in the idle config will be kept active. In case multiple cron expressions matches, only the first one is taken into account. Below is an idle configuration for keeping runners active from 9:00am to 5:59pm on working days. The cron expression generator by Cronhub is a great resource to set up your idle config.

By default, the oldest instances are evicted. This helps keep your environment up-to-date and reduce problems like running out of disk space or RAM. Alternatively, if your older instances have a long-living cache, you can override the evictionStrategy to newest_first to evict the newest instances first instead.

idle_config = [{
   cron             = "* * 9-17 * * 1-5"
   timeZone         = "Europe/Amsterdam"
   idleCount        = 2
   # Defaults to 'oldest_first'
   evictionStrategy = "oldest_first"
}]

Note: When using Windows runners it's recommended to keep a few runners warmed up due to the minutes-long cold start time.

Supported config

Cron expressions are parsed by cron-parser. The supported syntax.

*    *    *    *    *    *
┬    ┬    ┬    ┬    ┬    ┬
│    │    │    │    │    |
│    │    │    │    │    └ day of week (0 - 7) (0 or 7 is Sun)
│    │    │    │    └───── month (1 - 12)
│    │    │    └────────── day of month (1 - 31)
│    │    └─────────────── hour (0 - 23)
│    └──────────────────── minute (0 - 59)
└───────────────────────── second (0 - 59, optional)

For time zones please check TZ database name column for the supported values.

Ephemeral runners

You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:

  • The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the minimum_running_time_in_minutes to a value that is high enough to get your runner booted and connected to avoid it being terminated before executing a job.
  • The messages sent from the webhook lambda to the scale-up lambda are by default delayed by SQS, to give available runners a chance to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set delay_webhook_event to 0.
  • All events in the queue will lead to a new runner created by the lambda. By setting enable_job_queued_check to true you can enforce a rule of only creating a runner if the event has a correlated queued job. Setting this can avoid creating useless runners, for example when jobs got cancelled before a runner was created or if the job was already picked up by another runner. We suggest using this in combination with a pool.
  • To ensure runners are created in the same order GitHub sends the events, by default we use a FIFO queue. This is mainly relevant for repo level runners. For ephemeral runners you can set enable_fifo_build_queue to false.
  • Errors related to scaling should be retried via SQS. You can configure job_queue_retention_in_seconds and redrive_build_queue to tune the behavior. We have no mechanism to avoid events never being processed, which means potentially no runner gets created and the job in GitHub times out in 6 hours.

The example for ephemeral runners is based on the default example. Have look at the diff to see the major configuration differences.

Prebuilt Images

This module also allows you to run agents from a prebuilt AMI to gain faster startup times. The module provides several examples to build your own custom AMI. To remove old images, an AMI housekeeper module can be used. You can find more information in the image README.md for building custom images.

Experimental - Optional queue to publish GitHub workflow job events

This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GItHub App. For example to calculate a matrix or monitor the system.

To enable the feature set enable_workflow_job_events_queue = true. Be aware the feature in experimental!

Messages received on the queue are using the same format as published by GitHub wrapped in a property workflowJobEvent.

export interface GithubWorkflowEvent {
  workflowJobEvent: WorkflowJobEvent;
}

This extendible format allows more fields to be added if needed. You can configure the queue by setting properties to workflow_job_events_queue_config

NOTE: By default, a runner AMI update requires a re-apply of this terraform config (the runner AMI ID is looked up by a terraform data source). To avoid this, you can use ami_id_ssm_parameter_name to have the scale-up lambda dynamically lookup the runner AMI ID from an SSM parameter at instance launch time. Said SSM parameter is managed outside of this module (e.g. by a runner AMI build workflow).

Examples

Examples are located in the examples directory. The following examples are provided:

  • Default: The default example of the module
  • ARM64: Example usage with ARM64 architecture
  • Ephemeral: Example usages of ephemeral runners based on the default example.
  • Multi Runner : Example usage of creating a multi runner which creates multiple runners/ configurations with a single deployment
  • Permissions boundary: Example usages of permissions boundaries.
  • Prebuilt Images: Example usages of deploying runners with a custom prebuilt image.
  • Ubuntu: Example usage of creating a runner using Ubuntu AMIs.
  • Windows: Example usage of creating a runner using Windows as the OS.

Sub modules

The module contains several submodules, you can use the module via the main module or assemble your own setup by initializing the submodules yourself.

The following submodules are the core of the module and are mandatory:

The following sub modules are optional and are provided as examples or utilities:

ARM64 configuration for submodules. When using the top level module configure runner_architecture = "arm64" and ensure the list of instance_types matches. When not using the top-level, ensure these properties are set on the submodules.

Logging

The module uses AWS Lambda Powertools for logging. By default the log level is set to info, by setting the log level to debug the incoming events of the Lambda are logged as well.

Log messages contains at least the following keys:

  • messages: The logged messages
  • environment: The environment prefix provided via Terraform
  • service: The lambda
  • module: The TypeScript module writing the log message
  • function-name: The name of the lambda function (prefix + function name)
  • github: Depending on the lambda, contains GitHub context
  • runner: Depending on the lambda, specific context related to the runner

An example log message of the scale-up function:

{
    "level": "INFO",
    "message": "Received event",
    "service": "runners-scale-up",
    "timestamp": "2023-03-20T08:15:27.448Z",
    "xray_trace_id": "1-6418161e-08825c2f575213ef760531bf",
    "module": "scale-up",
    "region": "eu-west-1",
    "environment": "my-linux-x64",
    "aws-request-id": "eef1efb7-4c07-555f-9a67-b3255448ee60",
    "function-name": "my-linux-x64-scale-up",
    "runner": {
        "type": "Repo",
        "owner": "test-runners/multi-runner"
    },
    "github": {
        "event": "workflow_job",
        "workflow_job_id": "1234"
    }
}

Tracing

For the distributed architecture of this application it can be difficult to troubleshoot this application. We support the option to enable tracing for all the lambda functions created by this application. To enable tracing user can simply provide the tracing_config option inside the root module or inner modules.

This tracing config generates timelines for following events:

  • Basic lifecycle of lambda function
  • Traces for Github API calls (can be configured by capture_http_requests).
  • Traces for all AWS SDK calls

This feature has been disabled by default.

Debugging

In case the setup does not work as intended follow the trace of events:

  • In the GitHub App configuration, the Advanced page displays all webhook events that were sent.
  • In AWS CloudWatch, every lambda has a log group. Look at the logs of the webhook and scale-up lambdas.
  • In AWS SQS you can see messages available or in flight.
  • Once an EC2 instance is running, you can connect to it in the EC2 user interface using Session Manager (use enable_ssm_on_runners = true). Check the user data script using cat /var/log/user-data.log. By default several log files of the instances are streamed to AWS CloudWatch, look for a log group named <environment>/runners. In the log group you should see at least the log streams for the user data installation and runner agent.
  • Registered instances should show up in the Settings - Actions page of the repository or organization (depending on the installation mode).

Security Considerations

This module creates resources in your AWS infrastructure, and EC2 instances for hosting the self-hosted runners on-demand. IAM permissions are set to a minimal level, and could be further limited by using permission boundaries. Instances permissions are limited to retrieve and delete the registration token, access the instance's own tags, and terminate the instance itself. By nature instances are short-lived, we strongly suggest to use ephemeral runners to ensure a safe build environment for each workflow job execution.

Ephemeral runners are using the JIT configuration, confguration that only can be used once to activate a runner. For non-ephemeral runners this option is not provided by GitHub. For non-ephemeeral runners a registration token is passed via SSM. After using the token, the token is deleted. But the token remains valid and is potential available in memory on the runner. For ephemeral runners this problem is avoid by using just in time tokens.

The examples are using standard AMI's for different operation systems. Instances are not hardened, and sudo operation are not blocked. To provide an out of the box working experience by default the module installs and configures the runner. However secrets are not hard coded, they finally end up in the memory of the instances. You can harden the instance by providing your own AMI and overwriting the cloud-init script.

We welcome any improvement to the standard module to make the default as secure as possible, in the end it remains your responsibility to keep your environment secure.

Requirements

Name Version
terraform >= 1.3.0
aws ~> 5.2
random ~> 3.0

Providers

Name Version
aws ~> 5.2
random ~> 3.0

Modules

Name Source Version
ami_housekeeper ./modules/ami-housekeeper n/a
runner_binaries ./modules/runner-binaries-syncer n/a
runners ./modules/runners n/a
ssm ./modules/ssm n/a
webhook ./modules/webhook n/a

Resources

Name Type
aws_sqs_queue.queued_builds resource
aws_sqs_queue.queued_builds_dlq resource
aws_sqs_queue.webhook_events_workflow_job_queue resource
aws_sqs_queue_policy.build_queue_dlq_policy resource
aws_sqs_queue_policy.build_queue_policy resource
aws_sqs_queue_policy.webhook_events_workflow_job_queue_policy resource
random_string.random resource
aws_iam_policy_document.deny_unsecure_transport data source

Inputs

Name Description Type Default Required
ami_filter Map of lists used to create the AMI filter for the action runner AMI. map(list(string))
{
"state": [
"available"
]
}
no
ami_housekeeper_cleanup_config Configuration for AMI cleanup.

amiFilters - Filters to use when searching for AMIs to cleanup. Default filter for images owned by the account and that are available.
dryRun - If true, no AMIs will be deregistered. Default false.
launchTemplateNames - Launch template names to use when searching for AMIs to cleanup. Default no launch templates.
maxItems - The maximum numer of AMI's tha will be queried for cleanup. Default no maximum.
minimumDaysOld - Minimum number of days old an AMI must be to be considered for cleanup. Default 30.
ssmParameterNames - SSM parameter names to use when searching for AMIs to cleanup. This parameter should be set when using SSM to configure the AMI to use. Default no SSM parameters.
object({
amiFilters = optional(list(object({
Name = string
Values = list(string)
})),
[{
Name : "state",
Values : ["available"],
},
{
Name : "image-type",
Values : ["machine"],
}]
)
dryRun = optional(bool, false)
launchTemplateNames = optional(list(string))
maxItems = optional(number)
minimumDaysOld = optional(number, 30)
ssmParameterNames = optional(list(string))
})
{} no
ami_housekeeper_lambda_s3_key S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas. string null no
ami_housekeeper_lambda_s3_object_version S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket. string null no
ami_housekeeper_lambda_schedule_expression Scheduler expression for action runner binary syncer. string "rate(1 day)" no
ami_housekeeper_lambda_timeout Time out of the lambda in seconds. number 300 no
ami_housekeeper_lambda_zip File location of the lambda zip file. string null no
ami_id_ssm_parameter_name Externally managed SSM parameter (of data type aws:ec2:image) that contains the AMI ID to launch runner instances from. Overrides ami_filter string null no
ami_kms_key_arn Optional CMK Key ARN to be used to launch an instance from a shared encrypted AMI string null no
ami_owners The list of owners used to select the AMI of action runner instances. list(string)
[
"amazon"
]
no
associate_public_ipv4_address Associate public IPv4 with the runner. Only tested with IPv4 bool false no
aws_partition (optiona) partition in the arn namespace to use if not 'aws' string "aws" no
aws_region AWS region. string n/a yes
block_device_mappings The EC2 instance block device configuration. Takes the following keys: device_name, delete_on_termination, volume_type, volume_size, encrypted, iops, throughput, kms_key_id, snapshot_id.
list(object({
delete_on_termination = optional(bool, true)
device_name = optional(string, "/dev/xvda")
encrypted = optional(bool, true)
iops = optional(number)
kms_key_id = optional(string)
snapshot_id = optional(string)
throughput = optional(number)
volume_size = number
volume_type = optional(string, "gp3")
}))
[
{
"volume_size": 30
}
]
no
cloudwatch_config (optional) Replaces the module's default cloudwatch log config. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html for details. string null no
create_service_linked_role_spot (optional) create the service linked role for spot instances that is required by the scale-up lambda. bool false no
delay_webhook_event The number of seconds the event accepted by the webhook is invisible on the queue before the scale up lambda will receive the event. number 30 no
disable_runner_autoupdate Disable the auto update of the github runner agent. Be aware there is a grace period of 30 days, see also the GitHub article bool false no
enable_ami_housekeeper Option to disable the lambda to clean up old AMIs. bool false no
enable_cloudwatch_agent Enables the cloudwatch agent on the ec2 runner instances. The runner uses a default config that can be overridden via cloudwatch_config. bool true no
enable_ephemeral_runners Enable ephemeral runners, runners will only be used once. bool false no
enable_event_rule_binaries_syncer Option to disable EventBridge Lambda trigger for the binary syncer, useful to stop automatic updates of binary distribution. bool true no
enable_fifo_build_queue Enable a FIFO queue to keep the order of events received by the webhook. Recommended for repo level runners. bool false no
enable_jit_config Overwrite the default behavior for JIT configuration. By default JIT configuration is enabled for ephemeral runners and disabled for non-ephemeral runners. In case of GHES check first if the JIT config API is avaialbe. In case you upgradeing from 3.x to 4.x you can set enable_jit_config to false to avoid a breaking change when having your own AMI. bool null no
enable_job_queued_check Only scale if the job event received by the scale up lambda is in the queued state. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior. bool null no
enable_managed_runner_security_group Enables creation of the default managed security group. Unmanaged security groups can be specified via runner_additional_security_group_ids. bool true no
enable_organization_runners Register runners to organization, instead of repo level bool false no
enable_runner_binaries_syncer Option to disable the lambda to sync GitHub runner distribution, useful when using a pre-build AMI. bool true no
enable_runner_detailed_monitoring Should detailed monitoring be enabled for the runner. Set this to true if you want to use detailed monitoring. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html for details. bool false no
enable_runner_workflow_job_labels_check_all If set to true all labels in the workflow job must match the GitHub labels (os, architecture and self-hosted). When false if any label matches it will trigger the webhook. bool true no
enable_ssm_on_runners Enable to allow access to the runner instances for debugging purposes via SSM. Note that this adds additional permissions to the runner instances. bool false no
enable_user_data_debug_logging_runner Option to enable debug logging for user-data, this logs all secrets as well. bool false no
enable_userdata Should the userdata script be enabled for the runner. Set this to false if you are using your own prebuilt AMI. bool true no
enable_workflow_job_events_queue Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow_job event will be delivered. bool false no
ghes_ssl_verify GitHub Enterprise SSL verification. Set to 'false' when custom certificate (chains) is used for GitHub Enterprise Server (insecure). bool true no
ghes_url GitHub Enterprise Server URL. Example: https://github.internal.co - DO NOT SET IF USING PUBLIC GITHUB string null no
github_app GitHub app parameters, see your github app. Ensure the key is the base64-encoded .pem file (the output of base64 app.private-key.pem, not the content of private-key.pem).
object({
key_base64 = string
id = string
webhook_secret = string
})
n/a yes
idle_config List of time periods, defined as a cron expression, to keep a minimum amount of runners active instead of scaling down to 0. By defining this list you can ensure that in time periods that match the cron expression within 5 seconds a runner is kept idle.
list(object({
cron = string
timeZone = string
idleCount = number
evictionStrategy = optional(string, "oldest_first")
}))
[] no
instance_allocation_strategy The allocation strategy for spot instances. AWS recommends using price-capacity-optimized however the AWS default is lowest-price. string "lowest-price" no
instance_max_spot_price Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. string null no
instance_profile_path The path that will be added to the instance_profile, if not set the environment name will be used. string null no
instance_target_capacity_type Default lifecycle used for runner instances, can be either spot or on-demand. string "spot" no
instance_types List of instance types for the action runner. Defaults are based on runner_os (al2023 for linux and Windows Server Core for win). list(string)
[
"m5.large",
"c5.large"
]
no
job_queue_retention_in_seconds The number of seconds the job is held in the queue before it is purged. number 86400 no
key_name Key pair name string null no
kms_key_arn Optional CMK Key ARN to be used for Parameter Store. This key must be in the current account. string null no
lambda_architecture AWS Lambda architecture. Lambda functions using Graviton processors ('arm64') tend to have better price/performance than 'x86_64' functions. string "arm64" no
lambda_principals (Optional) add extra principals to the role created for execution of the lambda, e.g. for local testing.
list(object({
type = string
identifiers = list(string)
}))
[] no
lambda_runtime AWS Lambda runtime. string "nodejs18.x" no
lambda_s3_bucket S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly. string null no
lambda_security_group_ids List of security group IDs associated with the Lambda function. list(string) [] no
lambda_subnet_ids List of subnets in which the action runners will be launched, the subnets needs to be subnets in the vpc_id. list(string) [] no
lambda_tracing_mode DEPRECATED: Replaced by tracing_config. string null no
log_level Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. string "info" no
logging_kms_key_id Specifies the kms key id to encrypt the logs with. string null no
logging_retention_in_days Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. number 180 no
minimum_running_time_in_minutes The time an ec2 action runner should be running at minimum before terminated, if not busy. number null no
pool_config The configuration for updating the pool. The pool_size to adjust to by the events triggered by the schedule_expression. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1.
list(object({
schedule_expression = string
size = number
}))
[] no
pool_lambda_reserved_concurrent_executions Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. number 1 no
pool_lambda_timeout Time out for the pool lambda in seconds. number 60 no
pool_runner_owner The pool will deploy runners to the GitHub org ID, set this value to the org to which you want the runners deployed. Repo level is not supported. string null no
prefix The prefix used for naming resources string "github-actions" no
queue_encryption Configure how data on queues managed by the modules in ecrypted at REST. Options are encryped via SSE, non encrypted and via KMSS. By default encryptes via SSE is enabled. See for more details the Terraform aws_sqs_queue resource https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sqs_queue.
object({
kms_data_key_reuse_period_seconds = number
kms_master_key_id = string
sqs_managed_sse_enabled = bool
})
{
"kms_data_key_reuse_period_seconds": null,
"kms_master_key_id": null,
"sqs_managed_sse_enabled": true
}
no
redrive_build_queue Set options to attach (optional) a dead letter queue to the build queue, the queue between the webhook and the scale up lambda. You have the following options. 1. Disable by setting enabled to false. 2. Enable by setting enabled to true, maxReceiveCount to a number of max retries.
object({
enabled = bool
maxReceiveCount = number
})
{
"enabled": false,
"maxReceiveCount": null
}
no
repository_white_list List of github repository full names (owner/repo_name) that will be allowed to use the github app. Leave empty for no filtering. list(string) [] no
role_path The path that will be added to role path for created roles, if not set the environment name will be used. string null no
role_permissions_boundary Permissions boundary that will be added to the created roles. string null no
runner_additional_security_group_ids (optional) List of additional security groups IDs to apply to the runner. list(string) [] no
runner_architecture The platform architecture of the runner instance_type. string "x64" no
runner_as_root Run the action runner under the root user. Variable runner_run_as will be ignored. bool false no
runner_binaries_s3_logging_bucket Bucket for action runner distribution bucket access logging. string null no
runner_binaries_s3_logging_bucket_prefix Bucket prefix for action runner distribution bucket access logging. string null no
runner_binaries_s3_sse_configuration Map containing server-side encryption configuration for runner-binaries S3 bucket. any
{
"rule": {
"apply_server_side_encryption_by_default": {
"sse_algorithm": "AES256"
}
}
}
no
runner_binaries_s3_versioning Status of S3 versioning for runner-binaries S3 bucket. Once set to Enabled the change cannot be reverted via Terraform! string "Disabled" no
runner_binaries_syncer_lambda_timeout Time out of the binaries sync lambda in seconds. number 300 no
runner_binaries_syncer_lambda_zip File location of the binaries sync lambda zip file. string null no
runner_boot_time_in_minutes The minimum time for an EC2 runner to boot and register as a runner. number 5 no
runner_credit_specification The credit option for CPU usage of a T instance. Can be unset, "standard" or "unlimited". string null no
runner_ec2_tags Map of tags that will be added to the launch template instance tag specifications. map(string) {} no
runner_egress_rules List of egress rules for the GitHub runner instances.
list(object({
cidr_blocks = list(string)
ipv6_cidr_blocks = list(string)
prefix_list_ids = list(string)
from_port = number
protocol = string
security_groups = list(string)
self = bool
to_port = number
description = string
}))
[
{
"cidr_blocks": [
"0.0.0.0/0"
],
"description": null,
"from_port": 0,
"ipv6_cidr_blocks": [
"::/0"
],
"prefix_list_ids": null,
"protocol": "-1",
"security_groups": null,
"self": null,
"to_port": 0
}
]
no
runner_extra_labels Extra (custom) labels for the runners (GitHub). Labels checks on the webhook can be enforced by setting enable_workflow_job_labels_check. GitHub read-only labels should not be provided. list(string) [] no
runner_group_name Name of the runner group. string "Default" no
runner_iam_role_managed_policy_arns Attach AWS or customer-managed IAM policies (by ARN) to the runner IAM role list(string) [] no
runner_log_files (optional) Replaces the module default cloudwatch log config. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html for details.
list(object({
log_group_name = string
prefix_log_group = bool
file_path = string
log_stream_name = string
}))
null no
runner_metadata_options Metadata options for the ec2 runner instances. By default, the module uses metadata tags for bootstrapping the runner, only disable instance_metadata_tags when using custom scripts for starting the runner. map(any)
{
"http_endpoint": "enabled",
"http_put_response_hop_limit": 1,
"http_tokens": "required",
"instance_metadata_tags": "enabled"
}
no
runner_name_prefix The prefix used for the GitHub runner name. The prefix will be used in the default start script to prefix the instance name when register the runner in GitHub. The value is availabe via an EC2 tag 'ghr:runner_name_prefix'. string "" no
runner_os The EC2 Operating System type to use for action runner instances (linux,windows). string "linux" no
runner_run_as Run the GitHub actions agent as user. string "ec2-user" no
runners_lambda_s3_key S3 key for runners lambda function. Required if using S3 bucket to specify lambdas. string null no
runners_lambda_s3_object_version S3 object version for runners lambda function. Useful if S3 versioning is enabled on source bucket. string null no
runners_lambda_zip File location of the lambda zip file for scaling runners. string null no
runners_maximum_count The maximum number of runners that will be created. number 3 no
runners_scale_down_lambda_timeout Time out for the scale down lambda in seconds. number 60 no
runners_scale_up_lambda_timeout Time out for the scale up lambda in seconds. number 30 no
runners_ssm_housekeeper Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.

schedule_expression: is used to configure the schedule for the lambda.
enabled: enable or disable the lambda trigger via the EventBridge.
lambda_timeout: timeout for the lambda in seconds.
config: configuration for the lambda function. Token path will be read by default from the module.
object({
schedule_expression = optional(string, "rate(1 day)")
enabled = optional(bool, true)
lambda_timeout = optional(number, 60)
config = object({
tokenPath = optional(string)
minimumDaysOld = optional(number, 1)
dryRun = optional(bool, false)
})
})
{
"config": {}
}
no
scale_down_schedule_expression Scheduler expression to check every x for scale down. string "cron(*/5 * * * ? *)" no
scale_up_reserved_concurrent_executions Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. number 1 no
ssm_paths The root path used in SSM to store configuration and secrets.
object({
root = optional(string, "github-action-runners")
app = optional(string, "app")
runners = optional(string, "runners")
use_prefix = optional(bool, true)
})
{} no
subnet_ids List of subnets in which the action runner instances will be launched. The subnets need to exist in the configured VPC (vpc_id), and must reside in different availability zones (see philips-labs#2904) list(string) n/a yes
syncer_lambda_s3_key S3 key for syncer lambda function. Required if using an S3 bucket to specify lambdas. string null no
syncer_lambda_s3_object_version S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket. string null no
tags Map of tags that will be added to created resources. By default resources will be tagged with name and environment. map(string) {} no
tracing_config Configuration for lambda tracing.
object({
mode = optional(string, null)
capture_http_requests = optional(bool, false)
capture_error = optional(bool, false)
})
{} no
userdata_post_install Script to be ran after the GitHub Actions runner is installed on the EC2 instances string "" no
userdata_pre_install Script to be ran before the GitHub Actions runner is installed on the EC2 instances string "" no
userdata_template Alternative user-data template, replacing the default template. By providing your own user_data you have to take care of installing all required software, including the action runner. Variables userdata_pre/post_install are ignored. string null no
vpc_id The VPC for security groups of the action runners. string n/a yes
webhook_lambda_apigateway_access_log_settings Access log settings for webhook API gateway.
object({
destination_arn = string
format = string
})
null no
webhook_lambda_s3_key S3 key for webhook lambda function. Required if using S3 bucket to specify lambdas. string null no
webhook_lambda_s3_object_version S3 object version for webhook lambda function. Useful if S3 versioning is enabled on source bucket. string null no
webhook_lambda_timeout Time out of the webhook lambda in seconds. number 10 no
webhook_lambda_zip File location of the webhook lambda zip file. string null no
workflow_job_queue_configuration Configuration options for workflow job queue which is only applicable if the flag enable_workflow_job_events_queue is set to true.
object({
delay_seconds = number
visibility_timeout_seconds = number
message_retention_seconds = number
})
{
"delay_seconds": null,
"message_retention_seconds": null,
"visibility_timeout_seconds": null
}
no

Outputs

Name Description
binaries_syncer n/a
queues SQS queues.
runners n/a
ssm_parameters n/a
webhook n/a

Contributing

We welcome contributions, please checkout the contribution guide. Be aware we use pre commit hooks to update the docs.

Philips Forest

This module is part of the Philips Forest.

                                                     ___                   _
                                                    / __\__  _ __ ___  ___| |_
                                                   / _\/ _ \| '__/ _ \/ __| __|
                                                  / / | (_) | | |  __/\__ \ |_
                                                  \/   \___/|_|  \___||___/\__|

                                                                 Infrastructure

Talk to the forestkeepers in the runners-channel on Slack.

Slack