Education Data Platform (EDP) foundations

Building EDP starts with the construction of the infrastructure where the platform will be sitting. That's what this document is all about. It details the process of deploying EDP's foundation.

Before proceding with the steps described here, please, make sure to read through and comply with what is described in the pre-requisites documentation.

Before deploying EDP's foundation though, it is important to acknowledge important aspects related to the way resources are deployed. Security associated with flexibility are aspects we've taken from scratch to build it out.

Roles

We assign roles on resources at the project level, granting the appropriate roles via groups (humans) and service accounts (services and applications) according to best practices.

Service accounts

Service account creation follows the least privilege principle, performing a single task that requires access to a defined set of resources. The table below shows a high-level overview of roles for each service account on each data layer, using READ or WRITE access patterns for simplicity. For detailed roles please refer to the code.

Service Account	Drop off	DWH Landing	DWH Curated	DWH Confidential
`drop-sa`	`WRITE`	-	-	-
`load-sa`	`READ`	`READ`/`WRITE`	-	-
`transformation-sa`	-	`READ`/`WRITE`	`READ`/`WRITE`	`READ`/`WRITE`
`orchestration-sa`	-	-	-	-

A full reference of IAM roles managed by the Education Data Platform is available here.

Using service account keys within a data pipeline exposes several security risks deriving from a credentials leak. This blueprint shows how to leverage impersonation to avoid the need of creating keys.

Resource naming conventions

Resources follow the naming convention described below.

prefix-layer for projects
prefix-layer-prduct for resources
prefix-layer[2]-gcp-product[2]-counter for services and service accounts

How to run this script

The steps below highlight what it takes to deploy EDP's infrastructure. Follow them carefully to get the deployment done.

Before proceeding with those steps, please, make sure to double-check if none of the prerequisites needed for the scripts to run.

1. Clone the original repository

The very first step to getting EDP deployed is cloning the original repository (https://github.com/googlecloudplatform/education-data-platform) to your own organization. From there, everything can get started.

2. Variable configuration

First, under the directory "1-foundations", create a new file called "terraform.tfvars". This file will be used to set up Terraform's environment variables.

Once created, there are three sets of variables you will need to fill in:

billing_account_id  = "111111-222222-333333"
older_id            = "folders/123456789012"
organization_domain = "domain.com"
prefix              = "myco"

For more fine details check variables on variables.tf and update according to the desired configuration.

Once the configuration is complete, run the project factory by running

terraform init
terraform apply

3. How to use this blueprint from Terraform

While this blueprint can be used as a standalone deployment, it can also be called directly as a Terraform module by providing the values of the variables as shown below:

module "data-platform" {
  source              = "./fabric/blueprints/data-solutions/data-platform-foundations"
  billing_account_id  = var.billing_account_id
  folder_id           = var.folder_id
  organization_domain = "example.com"
  prefix              = "myprefix"
}

# tftest modules=42 resources=316

4. Data Catalog

Data Catalog helps you to document your data entry at scale. Data Catalog relies on tags and tag templates to manage metadata for all data entries in a unified and centralized service. To implement column-level security on BigQuery, we suggest using Tags and Tag templates.

The default configuration will implement 3 tags:

3_Confidential: policy tag for columns that include very sensitive information, such as credit card numbers.
2_Private: policy tag for columns that include sensitive personal identifiable information (PII) information, such as a person's first name.
1_Sensitive: policy tag for columns that include data that cannot be made public, such as the credit limit.

Anything that is not tagged is available to all users who have access to the data warehouse.

For the purpose of the blueprint, no group has access to tagged data. You can configure your tags and roles associated by configuring the data_catalog_tags variable. We suggest using the "Best practices for using policy tags in BigQuery" article as a guide to designing your tags' structure and access pattern.

Optional configuration

Encryption (optional)

We suggest a centralized approach to key management, where Organization Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the Education Data Platform.

To configure the use of Cloud KMS on resources, you have to specify the key id on the service_encryption_keys variable. Key locations should match resource locations. Example:

service_encryption_keys = {
    bq       = "KEY_URL_MULTIREGIONAL"
    composer = "KEY_URL_REGIONAL"
    dataflow = "KEY_URL_REGIONAL"
    storage  = "KEY_URL_MULTIREGIONAL"
    pubsub   = "KEY_URL_MULTIREGIONAL"
}

This step is optional and depends on customer policies and security best practices.

Data Anonymization (optional)

We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.

While implementing a Data Loss Prevention strategy is out of scope for this blueprint, we enable the service in two different projects so that Cloud Data Loss Prevention templates can be configured in one of two ways:

during the ingestion phase, from Dataflow
during the transformation phase, from BigQuery or Cloud Dataflow

Cloud Data Loss Prevention resources and templates should be stored in the security project:

You can find more details and best practices on using DLP to De-identification and re-identification of PII in large-scale datasets in the GCP documentation.

Customizations (optional)

Create Cloud Key Management keys as part of the Education Data Platform

To create Cloud Key Management keys in the Education Data Platform you can uncomment the Cloud Key Management resources configured in the 06-common.tf file and update Cloud Key Management keys pointers on local.service_encryption_keys.* to the local resource created.

Assign roles at BQ Dataset level

To handle multiple groups of data-analysts accessing the same Data Warehouse layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at the project level. To do this, you need to remove IAM binging at the project level for the data-analysts group and give roles at BigQuery dataset level using the iam variable on bigquery-dataset modules.

Demo pipeline

The application layer is out of the scope of this script. For a demo purpose only, several Cloud Composer DAGs are provided. Demos will import data from the drop off area to the Data Warehouse Confidential` dataset using different features.

You can find examples in the [demo](./demo) folder.

Variables

name	description	type	required	default
billing_account_id	Billing account id.	`string`	✓
folder_id	Folder to be used for the networking resources in folders/nnnn format.	`string`	✓
organization_domain	Organization domain.	`string`	✓
prefix	Prefix used for resource names.	`string`	✓
composer_config	Cloud Composer config.	`object({…})`		`{…}`
data_catalog_tags	List of Data Catalog Policy tags to be created with optional IAM binging configuration in {tag => {ROLE => [MEMBERS]}} format.	`map(map(list(string)))`		`{…}`
data_force_destroy	Flag to set 'force_destroy' on data services like BiguQery or Cloud Storage.	`bool`		`false`
groups	User groups.	`map(string)`		`{…}`
location	Location used for multi-regional resources.	`string`		`"us"`
network_config	Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values.	`object({…})`		`null`
project_services	List of core services enabled on all projects.	`list(string)`		`[…]`
project_suffix	Suffix used only for project ids.	`string`		`null`
region	Region used for regional resources.	`string`		`"us-west1"`
service_encryption_keys	Cloud KMS to use to encrypt different services. Key location should match service region.	`object({…})`		`null`

Outputs

name	description	sensitive
bigquery-datasets	BigQuery datasets.
demo_commands	Demo commands.
gcs-buckets	GCS buckets.
kms_keys	Cloud MKS keys.
projects	GCP Projects informations.
vpc_network	VPC network.
vpc_subnet	VPC subnetworks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Education Data Platform (EDP) foundations

Roles

Service accounts

Resource naming conventions

How to run this script

1. Clone the original repository

2. Variable configuration

3. How to use this blueprint from Terraform

4. Data Catalog

Optional configuration

Encryption (optional)

Data Anonymization (optional)

Customizations (optional)

Create Cloud Key Management keys as part of the Education Data Platform

Assign roles at BQ Dataset level

Demo pipeline

Variables

Outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Education Data Platform (EDP) foundations

Roles

Service accounts

Resource naming conventions

How to run this script

1. Clone the original repository

2. Variable configuration

3. How to use this blueprint from Terraform

4. Data Catalog

Optional configuration

Encryption (optional)

Data Anonymization (optional)

Customizations (optional)

Create Cloud Key Management keys as part of the Education Data Platform

Assign roles at BQ Dataset level

Demo pipeline

Variables

Outputs