- Login to IBM Cloud Pak for Data with valid credentials
To perform this lab you need IBM Cloud Pak for Data's credential which include both username and password.
Click on the Navigation Menu and expend Projects and click All Projects. Then click on ‘New Project +’ to create new project by following below steps shown in images
Click Create an empty project
to create empty project
Give a name to project (eg. Data_Fabric_Project
)
Once the project is created, you will redirect to Project
page as shown below
- Select Add to project + and choose Connection as asset type
- Choose Amazon S3 as connection type.
- Provide Amazon S3 connection details to make connection between Amazon S3 and IBM Cloud Pak for Data.
- Click Test connection to validate the connection. If it is successful click Create to create S3 connection.
- Similarly perform same step to create connection for asset type Amazon Redshift, and Amazon RDS for PostgreSQL.
Select Amazon Redshift connector
Give a name to connection (eg Amazon Redshift Connection
)
Again, Go to your Project
page click Add to project +
and choose Connection
Select Amazon RDS for PostgreSQL connector
Provide connection details
- Since we have created data source connection, now lets ingest data from connected data source. Click Add to project + and then click Connected data
Click Select source
Select apotheca_healthcare_personnel_data.csv
Give any unique name to asset
In the Data_Fabric_Project
page you will see the recently added asset
- Similarly collect data from redshift data source
Discover and select actavis_pharma_healthcare_personnel_table
Specify a unique name
Verify asset has been added to the project
- Similarly ingest data from Amazon Aurora PostgreSQL database.
Discover and select mylan_specialty_personnel_data_table
Specify a unique name
Verify asset has been added to the project
- To create integration pipeline, let's click Add to project + then DataStage flow.
- Enter DataStage flow name and click Create to create new DataStage flow.
- DataStage Homepage. Here you will see three options. Connectors to ingest or output to data source. Stages to perform ETL operation and Quality.
- Click Connectors to expand and then drag and drop Asset Browsers to datastage canvas.
- You need to select data assets to create integration pipeline. Select Data asset and then click all three data assets which ingested in previous step and then click Add.
- You should be able to see all three data assets as shown below.
- Now search funnel in the search box and drag and drop Funnel Stage to the canvas.
- Create links from all data assets to Funnel Stage as shown in below picture.
- Double click on all three data assets one by one and then click Output tab and then click Edit column to verify Data type, length and nullability of all columns
- Search remove duplicate stage and again drag and drop the stage.
- Double click on Remove Duplicate Stage to select criteria
- Search sort and drag and drop Sort stage. double click on sort stage and select sort criteria by id.
- Search rds and then drag and drop Amazon RDS for PostgreSQL connector to canvas.
- Double click on RDS connector to specify data source and table name
- Compile the datastage pipeline and run if compile succeeded.
- Now we have integrated data available in Amazon RDS for PostgreSQL. Let's ingest the data from the data source
Discover and select integrated table (eg. healthcare_personnel_integrated_data_table_v1
)
Specify the name to asset
Verify data asset is present in the project