Building EDP starts with the construction of the infrastructure where the platform will be sitting. That's what this document is all about. It details the process of deploying EDP's foundation.
Before proceding with the steps described here, please, make sure to read through and comply with what is described in the pre-requisites documentation.
Before deploying EDP's foundation though, it is important to acknowledge important aspects related to the way resources are deployed. Security associated with flexibility are aspects we've taken from scratch to build it out.
We assign roles on resources at the project level, granting the appropriate roles via groups (humans) and service accounts (services and applications) according to best practices.
Service account creation follows the least privilege principle, performing a single task that requires access to a defined set of resources. The table below shows a high-level overview of roles for each service account on each data layer, using READ
or WRITE
access patterns for simplicity. For detailed roles please refer to the code.
Service Account | Drop off | DWH Landing | DWH Curated | DWH Confidential |
---|---|---|---|---|
drop-sa |
WRITE |
- | - | - |
load-sa |
READ |
READ /WRITE |
- | - |
transformation-sa |
- | READ /WRITE |
READ /WRITE |
READ /WRITE |
orchestration-sa |
- | - | - | - |
A full reference of IAM roles managed by the Education Data Platform is available here.
Using service account keys within a data pipeline exposes several security risks deriving from a credentials leak. This blueprint shows how to leverage impersonation to avoid the need of creating keys.
Resources follow the naming convention described below.
prefix-layer
for projectsprefix-layer-prduct
for resourcesprefix-layer[2]-gcp-product[2]-counter
for services and service accounts
The steps below highlight what it takes to deploy EDP's infrastructure. Follow them carefully to get the deployment done.
Before proceeding with those steps, please, make sure to double-check if none of the prerequisites needed for the scripts to run.
The very first step to getting EDP deployed is cloning the original repository (https://github.com/googlecloudplatform/education-data-platform) to your own organization. From there, everything can get started.
First, under the directory "1-foundations", create a new file called "terraform.tfvars". This file will be used to set up Terraform's environment variables.
Once created, there are three sets of variables you will need to fill in:
billing_account_id = "111111-222222-333333"
older_id = "folders/123456789012"
organization_domain = "domain.com"
prefix = "myco"
For more fine details check variables on variables.tf
and update according to the desired configuration.
Once the configuration is complete, run the project factory by running
terraform init
terraform apply
While this blueprint can be used as a standalone deployment, it can also be called directly as a Terraform module by providing the values of the variables as shown below:
module "data-platform" {
source = "./fabric/blueprints/data-solutions/data-platform-foundations"
billing_account_id = var.billing_account_id
folder_id = var.folder_id
organization_domain = "example.com"
prefix = "myprefix"
}
# tftest modules=42 resources=316
Data Catalog helps you to document your data entry at scale. Data Catalog relies on tags and tag templates to manage metadata for all data entries in a unified and centralized service. To implement column-level security on BigQuery, we suggest using Tags
and Tag templates
.
The default configuration will implement 3 tags:
3_Confidential
: policy tag for columns that include very sensitive information, such as credit card numbers.2_Private
: policy tag for columns that include sensitive personal identifiable information (PII) information, such as a person's first name.1_Sensitive
: policy tag for columns that include data that cannot be made public, such as the credit limit.
Anything that is not tagged is available to all users who have access to the data warehouse.
For the purpose of the blueprint, no group has access to tagged data. You can configure your tags and roles associated by configuring the data_catalog_tags
variable. We suggest using the "Best practices for using policy tags in BigQuery" article as a guide to designing your tags' structure and access pattern.
We suggest a centralized approach to key management, where Organization Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the Education Data Platform.
To configure the use of Cloud KMS on resources, you have to specify the key id on the service_encryption_keys
variable. Key locations should match resource locations. Example:
service_encryption_keys = {
bq = "KEY_URL_MULTIREGIONAL"
composer = "KEY_URL_REGIONAL"
dataflow = "KEY_URL_REGIONAL"
storage = "KEY_URL_MULTIREGIONAL"
pubsub = "KEY_URL_MULTIREGIONAL"
}
This step is optional and depends on customer policies and security best practices.
We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.
While implementing a Data Loss Prevention strategy is out of scope for this blueprint, we enable the service in two different projects so that Cloud Data Loss Prevention templates can be configured in one of two ways:
- during the ingestion phase, from Dataflow
- during the transformation phase, from BigQuery or Cloud Dataflow
Cloud Data Loss Prevention resources and templates should be stored in the security project:
You can find more details and best practices on using DLP to De-identification and re-identification of PII in large-scale datasets in the GCP documentation.
To create Cloud Key Management keys in the Education Data Platform you can uncomment the Cloud Key Management resources configured in the 06-common.tf
file and update Cloud Key Management keys pointers on local.service_encryption_keys.*
to the local resource created.
To handle multiple groups of data-analysts
accessing the same Data Warehouse layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at the project level.
To do this, you need to remove IAM binging at the project level for the data-analysts
group and give roles at BigQuery dataset level using the iam
variable on bigquery-dataset
modules.
The application layer is out of the scope of this script. For a demo purpose only, several Cloud Composer DAGs are provided. Demos will import data from the drop off
area to the Data Warehouse
Confidential` dataset using different features.
You can find examples in the [demo](./demo)
folder.
name | description | type | required | default |
---|---|---|---|---|
billing_account_id | Billing account id. | string |
✓ | |
folder_id | Folder to be used for the networking resources in folders/nnnn format. | string |
✓ | |
organization_domain | Organization domain. | string |
✓ | |
prefix | Prefix used for resource names. | string |
✓ | |
composer_config | Cloud Composer config. | object({…}) |
{…} |
|
data_catalog_tags | List of Data Catalog Policy tags to be created with optional IAM binging configuration in {tag => {ROLE => [MEMBERS]}} format. | map(map(list(string))) |
{…} |
|
data_force_destroy | Flag to set 'force_destroy' on data services like BiguQery or Cloud Storage. | bool |
false |
|
groups | User groups. | map(string) |
{…} |
|
location | Location used for multi-regional resources. | string |
"us" |
|
network_config | Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values. | object({…}) |
null |
|
project_services | List of core services enabled on all projects. | list(string) |
[…] |
|
project_suffix | Suffix used only for project ids. | string |
null |
|
region | Region used for regional resources. | string |
"us-west1" |
|
service_encryption_keys | Cloud KMS to use to encrypt different services. Key location should match service region. | object({…}) |
null |
name | description | sensitive |
---|---|---|
bigquery-datasets | BigQuery datasets. | |
demo_commands | Demo commands. | |
gcs-buckets | GCS buckets. | |
kms_keys | Cloud MKS keys. | |
projects | GCP Projects informations. | |
vpc_network | VPC network. | |
vpc_subnet | VPC subnetworks. |