An example CDK app demonstrating how CDK can be utilized to create Amazon DataZone Resources. In addition to the publicly available DZ Resources (Documentation), this package also includes custom code to allow creation of:
- Project Memberships
- Glossaries
- Glossary Terms
- Metadata Forms
The CDK application currently supports:
- Creating a domain.
- Enabling Blueprints for the Data Lake and Data Warehouse.
- Creating projects for the domain and adding project members to the same.
- Creating Glossary and Glossary Terms for the project. Adding TermRelations is Not Supported currently as the same can be between glossary terms for multiple projects as well. So, the ideal product experience needs to be figured out for the same.
- Creating Metadata Forms for the project.
- Creating Environment Profiles and Environments.
- Creating Data Sources for the environments.
You need to have the following dependencies in place:
-
An AWS account (with Amazon IAM Identity Center enabled)
-
Bash/ZSH/WSL2
-
AWS credentials and profiles for each environment under ~/.aws/config here
- You must export
AWS_PROFILE
andAWS_REGION
containing the AWS Account credentials where you will deploy Amazon DataZone to. This is necessary to be present before performing any infrastructure deployment via AWS CDK below.
- You must export
-
Python version >= 3.12
-
AWS SDK for Python >= 1.34.87
-
Node >= v18.18.*
-
NPM >= v10.2.*
Next step is to install all the required npm dependencies for CDK
npm ci ### it installs the frozen dependencies from package-lock.json
Use npm run cdk bootstrap -- --region ${AWS_REGION} --profile ${AWS_PROFILE}
command to do initial bootstraping of the AWS Account
Please make sure the IAM role for AWS CDK deployment starting with "cdk-cfn-xxxxxx-exec-role-" is an administrator Data Lake administrator in AWS Lakeformation
Now that you have installed all the required dependencies locally and done the initial AWS Account bootstraping you need to pay attention to the following steps as those will help you with the deployment phase. Please keep the order as this is important.
Run the scripts/prepare.sh
and provide your AWS_ACCOUNT_ID and the AWS_REGION (same as the one used for the AWS CDK commands above). The preceding command will replace the AWS_ACCOUNT_ID_PLACEHOLDER and AWS_REGION_PLACEHOLDER values in the following config files:
lib/config/project_config.json
lib/config/project_environment_config.json
lib/constants.ts
This step is mandatory for single account deployment.
Make sure to check the CUSTOM_RESOURCE_LAMBDA_ROLE_ARN
(under constants.ts) to the role that you would use for deploying the CDK. This is a necessary step for creating the custom resource. By default, CDK uses the role created during bootstrapping and can be found in the CFN Stack with name CDKToolkit
with Logical ID being CloudFormationExecutionRole
.
On the IAM console, update the trust policy of the IAM role for your AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- by adding the following permissions. Replace ${ACCOUNT_ID} and ${REGION} with your specific AWS account and Region.
{
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": [
"arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryLambda*",
"arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryTermLambda*",
"arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-FormLambda*"
]
}
}
}
Make sure to replace ${ACCOUNT_ID}
and ${REGION}
to the account/region specific ones.
This is important for the Glossary/GlossaryTerm/Form Lambdas which assume the cdk-hnb659fds-cfn-exec-role-${ACCOUNT_ID}-${REGION}
to perform those actions. (There we use the default CDK Qualifier which then adds the hnb659fds
to the name, the idea is the execution role is defined in the CUSTOM_RESOURCE_LAMBDA_ROLE_ARN
and the trust relationship has to be updated on that particular role which is used).
This would enable the Custom Resources to use the CDK Role and perform the desired operations. Without this replacement the CDK Execution role cannot be assumed by the Custom Resource Lambdas to perform the functionalities mentioned above.
The following custom configurations are currently supported:
-
Domain
- Update DOMAIN_NAME as needed (under constants.ts).
- SSO Configuration for domain can be enabled by updating the
SHOULD_ENABLE_SSO_FOR_DOMAIN
(under constants.ts).
NOTE - If you want to use a different DOMAIN_EXECUTION_ROLE_NAME
or KMS Key Name
for setting up domain then the same can be updated in constants.ts).
-
Environment Profiles, Environments and Data Sources Please update [project_enviornment_config.json] as follows-->
- You can create an Environment Profile for either DataLake or DataWarehouse blueprints which can be configured by the EnvironmentBlueprintIdentifier. In addition to this, you can add a Description, AWS Account and Region to which the environment should be deployed.
- You can add multiple environments for a single environment profile by updating the Environments field. Each environment can have a name, description, and multiple data sources.
- Data source name needs to be updated by updating
DATA_SOURCE_NAME_PLACEHOLDER
and the description by updatingDATA_SOURCE_DESCRIPTION_PLACEHOLDER
. Please refer to AWS::DataZone::DataSource for the various fields that you can configure for a data source. This would create a data source for the environment. Refer DataZone DataSource for the various fields that can be configured for a data source. - Please update
EXISTING_GLUE_DB_NAME_PLACEHOLDER
with the existing AWS Glue database name in the same AWS account where you are deploying the Amazon DataZone core components with the AWS CDK. Please make sure you have at least 1 existing Glue table to publish as a data source within the Amazon Datazone. The schedule in the following code is an example to schedule the data source job run in Amazon DataZone; you can change it according to your requirements. 5.You can create an Environment Profile for either DataLake or DataWarehouse blueprints which can be configured by the EnvironmentBlueprintIdentifier. In addition to this, you can add a Description, AWS Account and Region to which the environment should be deployed. - You can add Multiple Environments for a single Environment Profile by updating the
Environments
field. Each environment can have a Name, Description, and multiple Data Sources.
-
Project and Project Owners
- You can use project_config.json for creating new projects. Currently, Project Name, Description and Owner can be configured.
- You can also configure the administrator or owner of the project using
projectOwnerIdentifier
. This has been added as the CDK Role would be the default owner of the project, so this configuration would allow the user to have the correct entity as the Project Member.
NOTE -
-
The
projectOwnerIdentifier
is the entity that you want to use as the owner of the Project. This can be any IAM Principal or any SSO Group that you would ideally want to add as an owner of the project. -
If you want to use groups as ProjectOwner, then use
GroupIdentifier
as theprojectOwnerIdentifierType
else useUserIdentifier
if you want to set the same an IAM principal or SSO User. -
The Project Members custom resource can also be expanded to add more members to the Project.
-
Glossaries and Glossary Terms An example business glossary and glossary terms are provided for the projects
- You can add glossaries and glossary terms to project using project_glossary_config.json.
- You need to specify the Project Name under which the Glossary should be created.
- Each Glossary can have a Name, Description and Glossary Terms.
- A Glossary Terms can have a Name, ShortDescription, LongDescription.
NOTE - Adding TermRelations is Not Supported currently as the same can be between glossary terms for multiple projects as well. So, the ideal product experience needs to be figured out for the same.
-
Metadata Forms
- You can add Metadata Forms to the project using project_form_config.json.
- You need to specify the Project Name under which the Form should be created.
- A Form can have Name, Description, and model.
NOTE - Creating Metadata forms requires a stringified Smithy content for the form. You can use this smithy structure to create the same.
@amazon.datazone#displayname(defaultName: "MetaDataDisplayName")
structure MetaDataTechnicalName {
@documentation("FieldDescription")
@required
@amazon.datazone#displayname(defaultName: "FieldDisplayName")
FieldTechnicalName: Integer
}
where:
MetaDataTechnicalName
is the name of the form being created, should be same asformName
in project_form_config.json.MetaDataDisplayName
is the display name for the form. ChangeMetaDataDisplayName
to the name that you want.@documentation
is the description for the field.@required
denotes that the field is mandatory. Remove the same to make it optional.FieldDisplayName
is the display name for the field. ChangeFieldDisplayName
to the name that you want.FieldTechnicalName
is the technical name of the field. It can be of any type supported by the smithy.
In addition to these, you can also use other smithy annotations like default values, range etc. for configuring the structure.
Now that you have all the configurations in place, from the local environment to the infrastructure as code level including Amazon DataZone you can proceed with the initial deployment by running the following command (make sure you have exported the right AWS_REGION and AWS_PROFILE as environment variables):
npm run cdk deploy -- --all --region ${AWS_REGION} --profile ${AWS_PROFILE}
This will take a while, please keep an eye on the cli outputs. During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?
.
Note: In case you have data already shared via Amazon DataZone or any user association and you want to destroy the entire infrastructure during cleanup phase above then you have to remove those manually first in the AWS Console as AWS CDK won't be able to automatically do that for you.
To avoid incurring future charges, delete the resources. If you have already shared the data source using Amazon DataZone, then you have to remove those manually first in the Amazon DataZone data portal because the AWS CDK isn’t able to automatically do that.
- Please unpublished the data within Amazon Datazone data portal manually
- Delete the data asset manually from the Amazon Datazone data portal
- From the root of your repository folder, run the following command:
npm run cdk destroy -- --all --region ${AWS_REGION} --profile ${AWS_PROFILE}
-
Delete the Amazon DataZone created databases in AWS Glue. If needed, refer to the tips to troubleshoot Lake Formation permission errors in AWS Glue
-
Remove the created IAM roles from Lake Formation administrative roles and tasks.
-
If you get the message:
"Domain name already exists under this account, please use another one (Service: DataZone, Status Code: 409, Request ID: 2d054cb0-0 fb7-466f-ae04-c53ff3c57c9a)" (RequestToken: 85ab4aa7-9e22-c7e6-8f00-80b5871e4bf7, HandlerErrorCode: AlreadyExists)
, change the domain name underlib/constants.ts
and try to deploy again. -
If you get the message
"Resource of type 'AWS::IAM::Role' with identifier 'CustomResourceProviderRole1' already exists." (RequestToken: 17a6384e-7b0f-03b3 -1161-198fb044464d, HandlerErrorCode: AlreadyExists)
, this means you’re accidentally trying to deploy everything in the same account but a different Region. Make sure to use the Region you configured in your initial deployment. For the sake of simplicity, theDataZonePreReqStack
is in one Region in the same account. -
If you get the message
"Unmanaged asset” Warning in the data asset on you datazone project
, you must explicitly provide Amazon DataZone with Lake Formation permissions to access tables in this external AWS Glue database. For instructions, refer to Configure Lake Formation permissions for Amazon DataZone.