Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

Latest commit

 

History

History
221 lines (111 loc) · 6.8 KB

File metadata and controls

221 lines (111 loc) · 6.8 KB

Step 1: Login to IBM Cloud Pak for Data

  1. Login to IBM Cloud Pak for Data with valid credentials

To perform this lab you need IBM Cloud Pak for Data's credential which include both username and password.

Login

2. Create new project

Click on the Navigation Menu and expend Projects and click All Projects. Then click on ‘New Project +’ to create new project by following below steps shown in images

DV Menu DV Menu

Click Create an empty project to create empty project

DV Menu

Give a name to project (eg. Data_Fabric_Project)

DV Menu

Once the project is created, you will redirect to Project page as shown below

DV Menu

3. Create new connections with external data sources.

  1. Select Add to project + and choose Connection as asset type

Create connection

  1. Choose Amazon S3 as connection type.

Amazon S3

  1. Provide Amazon S3 connection details to make connection between Amazon S3 and IBM Cloud Pak for Data.

Amazon S3 Connection Amazon S3 Connection

  1. Click Test connection to validate the connection. If it is successful click Create to create S3 connection.

Amazon S3 Test Connection

Amazon S3 Test Connection

  1. Similarly perform same step to create connection for asset type Amazon Redshift, and Amazon RDS for PostgreSQL.

Create connection

Select Amazon Redshift connector

Data Ingestion

Give a name to connection (eg Amazon Redshift Connection)

Data Ingestion

Again, Go to your Project page click Add to project + and choose Connection

Create connection

Select Amazon RDS for PostgreSQL connector

Data Ingestion

Provide connection details

Data Ingestion

  1. Since we have created data source connection, now lets ingest data from connected data source. Click Add to project + and then click Connected data

Data Ingestion

Click Select source

Data Ingestion

Select apotheca_healthcare_personnel_data.csv

Data Ingestion

Give any unique name to asset

Data Ingestion

In the Data_Fabric_Project page you will see the recently added asset

Data Ingestion

  1. Similarly collect data from redshift data source

Data Ingestion

Discover and select actavis_pharma_healthcare_personnel_table

Data Ingestion

Specify a unique name

Data Ingestion

Verify asset has been added to the project

Data Ingestion

  1. Similarly ingest data from Amazon Aurora PostgreSQL database.

Data Ingestion

Discover and select mylan_specialty_personnel_data_table

Data Ingestion

Specify a unique name

Data Ingestion

Verify asset has been added to the project

Data Ingestion

  1. To create integration pipeline, let's click Add to project + then DataStage flow.

DataStage

  1. Enter DataStage flow name and click Create to create new DataStage flow.

DataStage

  1. DataStage Homepage. Here you will see three options. Connectors to ingest or output to data source. Stages to perform ETL operation and Quality.

DataStage Homepage

DataStage Homepage

DataStage Homepage

  1. Click Connectors to expand and then drag and drop Asset Browsers to datastage canvas.

DataStage Asset Browser

  1. You need to select data assets to create integration pipeline. Select Data asset and then click all three data assets which ingested in previous step and then click Add.

DataStage Asset Browser

  1. You should be able to see all three data assets as shown below.

DataStage Asset Browser

  1. Now search funnel in the search box and drag and drop Funnel Stage to the canvas.

DataStage Asset Browser

  1. Create links from all data assets to Funnel Stage as shown in below picture.

DataStage Asset Browser

  1. Double click on all three data assets one by one and then click Output tab and then click Edit column to verify Data type, length and nullability of all columns

DataStage Asset Browser

DataStage Asset Browser

DataStage Asset Browser

DataStage Asset Browser

  1. Search remove duplicate stage and again drag and drop the stage.

DataStage Asset Browser

  1. Double click on Remove Duplicate Stage to select criteria

DataStage Asset Browser

DataStage Asset Browser

  1. Search sort and drag and drop Sort stage. double click on sort stage and select sort criteria by id.

DataStage Asset Browser

DataStage Asset Browser

  1. Search rds and then drag and drop Amazon RDS for PostgreSQL connector to canvas.

DataStage Asset Browser

  1. Double click on RDS connector to specify data source and table name

DataStage Asset Browser

  1. Compile the datastage pipeline and run if compile succeeded.

DataStage Asset Browser

DataStage Asset Browser

DataStage Asset Browser

  1. Now we have integrated data available in Amazon RDS for PostgreSQL. Let's ingest the data from the data source

Data Ingestion

Discover and select integrated table (eg. healthcare_personnel_integrated_data_table_v1)

DataStage Asset Browser

Specify the name to asset

DataStage Asset Browser

Verify data asset is present in the project

DataStage Asset Browser