Skip to content

Latest commit

 

History

History
184 lines (122 loc) · 15.2 KB

File metadata and controls

184 lines (122 loc) · 15.2 KB

Education Data Platform (EDP) foundations

Building EDP starts with the construction of the infrastructure where the platform will be sitting. That's what this document is all about. It details the process of deploying EDP's foundation.

Before proceding with the steps described here, please, make sure to read through and comply with what is described in the pre-requisites documentation.

Before deploying EDP's foundation though, it is important to acknowledge important aspects related to the way resources are deployed. Security associated with flexibility are aspects we've taken from scratch to build it out.

Roles

We assign roles on resources at the project level, granting the appropriate roles via groups (humans) and service accounts (services and applications) according to best practices.

Service accounts

Service account creation follows the least privilege principle, performing a single task that requires access to a defined set of resources. The table below shows a high-level overview of roles for each service account on each data layer, using READ or WRITE access patterns for simplicity. For detailed roles please refer to the code.

Service Account Drop off DWH Landing DWH Curated DWH Confidential
drop-sa WRITE - - -
load-sa READ READ/WRITE - -
transformation-sa - READ/WRITE READ/WRITE READ/WRITE
orchestration-sa - - - -

A full reference of IAM roles managed by the Education Data Platform is available here.

Using service account keys within a data pipeline exposes several security risks deriving from a credentials leak. This blueprint shows how to leverage impersonation to avoid the need of creating keys.

Resource naming conventions

Resources follow the naming convention described below.

  • prefix-layer for projects
  • prefix-layer-prduct for resources
  • prefix-layer[2]-gcp-product[2]-counter for services and service accounts

How to run this script

The steps below highlight what it takes to deploy EDP's infrastructure. Follow them carefully to get the deployment done.

Before proceeding with those steps, please, make sure to double-check if none of the prerequisites needed for the scripts to run.

1. Clone the original repository

The very first step to getting EDP deployed is cloning the original repository (https://github.com/googlecloudplatform/education-data-platform) to your own organization. From there, everything can get started.

2. Variable configuration

First, under the directory "1-foundations", create a new file called "terraform.tfvars". This file will be used to set up Terraform's environment variables.

Once created, there are three sets of variables you will need to fill in:

billing_account_id  = "111111-222222-333333"
older_id            = "folders/123456789012"
organization_domain = "domain.com"
prefix              = "myco"

For more fine details check variables on variables.tf and update according to the desired configuration.

Once the configuration is complete, run the project factory by running

terraform init
terraform apply

3. How to use this blueprint from Terraform

While this blueprint can be used as a standalone deployment, it can also be called directly as a Terraform module by providing the values of the variables as shown below:

module "data-platform" {
  source              = "./fabric/blueprints/data-solutions/data-platform-foundations"
  billing_account_id  = var.billing_account_id
  folder_id           = var.folder_id
  organization_domain = "example.com"
  prefix              = "myprefix"
}

# tftest modules=42 resources=316

4. Data Catalog

Data Catalog helps you to document your data entry at scale. Data Catalog relies on tags and tag templates to manage metadata for all data entries in a unified and centralized service. To implement column-level security on BigQuery, we suggest using Tags and Tag templates.

The default configuration will implement 3 tags:

  • 3_Confidential: policy tag for columns that include very sensitive information, such as credit card numbers.
  • 2_Private: policy tag for columns that include sensitive personal identifiable information (PII) information, such as a person's first name.
  • 1_Sensitive: policy tag for columns that include data that cannot be made public, such as the credit limit.

Anything that is not tagged is available to all users who have access to the data warehouse.

For the purpose of the blueprint, no group has access to tagged data. You can configure your tags and roles associated by configuring the data_catalog_tags variable. We suggest using the "Best practices for using policy tags in BigQuery" article as a guide to designing your tags' structure and access pattern.

Optional configuration

Encryption (optional)

We suggest a centralized approach to key management, where Organization Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the Education Data Platform.

Centralized Cloud Key Management high-level diagram

To configure the use of Cloud KMS on resources, you have to specify the key id on the service_encryption_keys variable. Key locations should match resource locations. Example:

service_encryption_keys = {
    bq       = "KEY_URL_MULTIREGIONAL"
    composer = "KEY_URL_REGIONAL"
    dataflow = "KEY_URL_REGIONAL"
    storage  = "KEY_URL_MULTIREGIONAL"
    pubsub   = "KEY_URL_MULTIREGIONAL"
}

This step is optional and depends on customer policies and security best practices.

Data Anonymization (optional)

We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.

While implementing a Data Loss Prevention strategy is out of scope for this blueprint, we enable the service in two different projects so that Cloud Data Loss Prevention templates can be configured in one of two ways:

Cloud Data Loss Prevention resources and templates should be stored in the security project:

Centralized Cloud Data Loss Prevention high-level diagram

You can find more details and best practices on using DLP to De-identification and re-identification of PII in large-scale datasets in the GCP documentation.

Customizations (optional)

Create Cloud Key Management keys as part of the Education Data Platform

To create Cloud Key Management keys in the Education Data Platform you can uncomment the Cloud Key Management resources configured in the 06-common.tf file and update Cloud Key Management keys pointers on local.service_encryption_keys.* to the local resource created.

Assign roles at BQ Dataset level

To handle multiple groups of data-analysts accessing the same Data Warehouse layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at the project level. To do this, you need to remove IAM binging at the project level for the data-analysts group and give roles at BigQuery dataset level using the iam variable on bigquery-dataset modules.

Demo pipeline

The application layer is out of the scope of this script. For a demo purpose only, several Cloud Composer DAGs are provided. Demos will import data from the drop off area to the Data Warehouse Confidential` dataset using different features.

You can find examples in the [demo](./demo) folder.

Variables

name description type required default
billing_account_id Billing account id. string
folder_id Folder to be used for the networking resources in folders/nnnn format. string
organization_domain Organization domain. string
prefix Prefix used for resource names. string
composer_config Cloud Composer config. object({…}) {…}
data_catalog_tags List of Data Catalog Policy tags to be created with optional IAM binging configuration in {tag => {ROLE => [MEMBERS]}} format. map(map(list(string))) {…}
data_force_destroy Flag to set 'force_destroy' on data services like BiguQery or Cloud Storage. bool false
groups User groups. map(string) {…}
location Location used for multi-regional resources. string "us"
network_config Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values. object({…}) null
project_services List of core services enabled on all projects. list(string) […]
project_suffix Suffix used only for project ids. string null
region Region used for regional resources. string "us-west1"
service_encryption_keys Cloud KMS to use to encrypt different services. Key location should match service region. object({…}) null

Outputs

name description sensitive
bigquery-datasets BigQuery datasets.
demo_commands Demo commands.
gcs-buckets GCS buckets.
kms_keys Cloud MKS keys.
projects GCP Projects informations.
vpc_network VPC network.
vpc_subnet VPC subnetworks.