Terraform module that provisions AWS resources to run Metaflow in production.
This module consists of submodules that can be used separately as well:
- AWS Batch cluster to run Metaflow steps (
metaflow-computation
) - blob storage and metadata database (
metaflow-datastore
) - a service providing API to record and query past executions (
metaflow-metadata-service
) - resources to deploy Metaflow flows on Step Functions processing (
metaflow-step-functions
) - Metaflow UI(
metaflow-ui
)
You can either use this high-level module, or submodules individually. See each module's corresponding README.md
for more details.
Here's a minimal end-to-end example of using this module with VPC:
# Random suffix for this deployment
resource "random_string" "suffix" {
length = 8
special = false
upper = false
}
locals {
resource_prefix = "metaflow"
resource_suffix = random_string.suffix.result
}
data "aws_availability_zones" "available" {
}
# VPC infra using https://github.com/terraform-aws-modules/terraform-aws-vpc
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.13.0"
name = "${local.resource_prefix}-${local.resource_suffix}"
cidr = "10.10.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = ["10.10.8.0/21", "10.10.16.0/21", "10.10.24.0/21"]
public_subnets = ["10.10.128.0/21", "10.10.136.0/21", "10.10.144.0/21"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
}
module "metaflow" {
source = "outerbounds/metaflow/aws"
version = "0.3.0"
resource_prefix = local.resource_prefix
resource_suffix = local.resource_suffix
enable_step_functions = false
subnet1_id = module.vpc.public_subnets[0]
subnet2_id = module.vpc.public_subnets[1]
vpc_cidr_blocks = module.vpc.vpc_cidr_blocks
vpc_id = module.vpc.vpc_id
with_public_ip = var.with_public_ip
tags = {
"managedBy" = "terraform"
}
}
# The module will generate a Metaflow config in JSON format, write it to a file
resource "local_file" "metaflow_config" {
content = module.metaflow.metaflow_profile_json
filename = "./metaflow_profile.json"
}
You can find a more complete example that uses this module but also includes setting up sagemaker notebooks and other non-Metaflow-specific parts of infra in this repo.
Name | Source | Version |
---|---|---|
metaflow-common | ./modules/common | n/a |
metaflow-computation | ./modules/computation | n/a |
metaflow-datastore | ./modules/datastore | n/a |
metaflow-metadata-service | ./modules/metadata-service | n/a |
metaflow-step-functions | ./modules/step-functions | n/a |
metaflow-ui | ./modules/ui | n/a |
Name | Description | Type | Default | Required |
---|---|---|---|---|
access_list_cidr_blocks | List of CIDRs we want to grant access to our Metaflow Metadata Service. Usually this is our VPN's CIDR blocks. | list(string) |
[] |
no |
api_basic_auth | Enable basic auth for API Gateway? (requires key export) | bool |
true |
no |
batch_type | AWS Batch Compute Type ('ec2', 'fargate') | string |
"ec2" |
no |
compute_environment_desired_vcpus | Desired Starting VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | number |
8 |
no |
compute_environment_egress_cidr_blocks | CIDR blocks to which egress is allowed from the Batch Compute environment's security group | list(string) |
[ |
no |
compute_environment_instance_types | The instance types for the compute environment | list(string) |
[ |
no |
compute_environment_max_vcpus | Maximum VCPUs for Batch Compute Environment [16-96] | number |
64 |
no |
compute_environment_min_vcpus | Minimum VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | number |
8 |
no |
enable_custom_batch_container_registry | Provisions infrastructure for custom Amazon ECR container registry if enabled | bool |
false |
no |
enable_step_functions | Provisions infrastructure for step functions if enabled | bool |
n/a | yes |
extra_ui_backend_env_vars | Additional environment variables for UI backend container | map(string) |
{} |
no |
extra_ui_static_env_vars | Additional environment variables for UI static app | map(string) |
{} |
no |
iam_partition | IAM Partition (Select aws-us-gov for AWS GovCloud, otherwise leave as is) | string |
"aws" |
no |
launch_template_http_endpoint | Whether the metadata service is available. Can be 'enabled' or 'disabled' | string |
"enabled" |
no |
launch_template_http_put_response_hop_limit | The desired HTTP PUT response hop limit for instance metadata requests. Can be an integer from 1 to 64 | number |
2 |
no |
launch_template_http_tokens | Whether or not the metadata service requires session tokens, also referred to as Instance Metadata Service Version 2 (IMDSv2). Can be 'optional' or 'required' | string |
"optional" |
no |
metadata_service_container_image | Container image for metadata service | string |
"" |
no |
resource_prefix | string prefix for all resources | string |
"metaflow" |
no |
resource_suffix | string suffix for all resources | string |
"" |
no |
subnet1_id | First subnet used for availability zone redundancy | string |
n/a | yes |
subnet2_id | Second subnet used for availability zone redundancy | string |
n/a | yes |
tags | aws tags | map(string) |
n/a | yes |
ui_alb_internal | Defines whether the ALB for the UI is internal | bool |
false |
no |
ui_allow_list | List of CIDRs we want to grant access to our Metaflow UI Service. Usually this is our VPN's CIDR blocks. | list(string) |
[] |
no |
ui_certificate_arn | SSL certificate for UI. If set to empty string, UI is disabled. | string |
"" |
no |
ui_static_container_image | Container image for the UI frontend app | string |
"" |
no |
vpc_cidr_blocks | The VPC CIDR blocks that we'll access list on our Metadata Service API to allow all internal communications | list(string) |
n/a | yes |
vpc_id | The id of the single VPC we stood up for all Metaflow resources to exist in. | string |
n/a | yes |
with_public_ip | Enable public IP assignment for the Metadata Service. Typically you want this to be set to true if using public subnets as subnet1_id and subnet2_id, and false otherwise | bool |
false |
no |
Name | Description |
---|---|
METAFLOW_BATCH_JOB_QUEUE | AWS Batch Job Queue ARN for Metaflow |
METAFLOW_DATASTORE_SYSROOT_S3 | Amazon S3 URL for Metaflow DataStore |
METAFLOW_DATATOOLS_S3ROOT | Amazon S3 URL for Metaflow DataTools |
METAFLOW_ECS_S3_ACCESS_IAM_ROLE | Role for AWS Batch to Access Amazon S3 |
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE | IAM role for Amazon EventBridge to access AWS Step Functions. |
METAFLOW_SERVICE_INTERNAL_URL | URL for Metadata Service (Accessible in VPC) |
METAFLOW_SERVICE_URL | URL for Metadata Service (Accessible in VPC) |
METAFLOW_SFN_DYNAMO_DB_TABLE | AWS DynamoDB table name for tracking AWS Step Functions execution metadata. |
METAFLOW_SFN_IAM_ROLE | IAM role for AWS Step Functions to access AWS resources (AWS Batch, AWS DynamoDB). |
api_gateway_rest_api_id_key_id | API Gateway Key ID for Metadata Service. Fetch Key from AWS Console [METAFLOW_SERVICE_AUTH_KEY] |
batch_compute_environment_security_group_id | The ID of the security group attached to the Batch Compute environment. |
datastore_s3_bucket_kms_key_arn | The ARN of the KMS key used to encrypt the Metaflow datastore S3 bucket |
metadata_svc_ecs_task_role_arn | n/a |
metaflow_api_gateway_rest_api_id | The ID of the API Gateway REST API we'll use to accept MetaData service requests to forward to the Fargate API instance |
metaflow_batch_container_image | The ECR repo containing the metaflow batch image |
metaflow_profile_json | Metaflow profile JSON object that can be used to communicate with this Metaflow Stack. Store this in ~/.metaflow/config_[stack-name] and select with $ export METAFLOW_PROFILE=[stack-name] . |
metaflow_s3_bucket_arn | The ARN of the bucket we'll be using as blob storage |
metaflow_s3_bucket_name | The name of the bucket we'll be using as blob storage |
migration_function_arn | ARN of DB Migration Function |
ui_alb_arn | UI ALB ARN |
ui_alb_dns_name | UI ALB DNS name |