Step 1: Login to IBM Cloud Pak for Data

Login to IBM Cloud Pak for Data with valid credentials

To perform this lab you need IBM Cloud Pak for Data's credential which include both username and password.

2. Create new project

Click on the Navigation Menu and expend Projects and click All Projects. Then click on ‘New Project +’ to create new project by following below steps shown in images

Click Create an empty project to create empty project

Give a name to project (eg. Data_Fabric_Project)

Once the project is created, you will redirect to Project page as shown below

3. Create new connections with external data sources.

Select Add to project + and choose Connection as asset type

Choose Amazon S3 as connection type.

Provide Amazon S3 connection details to make connection between Amazon S3 and IBM Cloud Pak for Data.

Click Test connection to validate the connection. If it is successful click Create to create S3 connection.

Similarly perform same step to create connection for asset type Amazon Redshift, and Amazon RDS for PostgreSQL.

Select Amazon Redshift connector

Give a name to connection (eg Amazon Redshift Connection)

Again, Go to your Project page click Add to project + and choose Connection

Select Amazon RDS for PostgreSQL connector

Provide connection details

Since we have created data source connection, now lets ingest data from connected data source. Click Add to project + and then click Connected data

Click Select source

Select apotheca_healthcare_personnel_data.csv

Give any unique name to asset

In the Data_Fabric_Project page you will see the recently added asset

Similarly collect data from redshift data source

Discover and select actavis_pharma_healthcare_personnel_table

Specify a unique name

Verify asset has been added to the project

Similarly ingest data from Amazon Aurora PostgreSQL database.

Discover and select mylan_specialty_personnel_data_table

Specify a unique name

Verify asset has been added to the project

To create integration pipeline, let's click Add to project + then DataStage flow.

Enter DataStage flow name and click Create to create new DataStage flow.

DataStage Homepage. Here you will see three options. Connectors to ingest or output to data source. Stages to perform ETL operation and Quality.

Click Connectors to expand and then drag and drop Asset Browsers to datastage canvas.

You need to select data assets to create integration pipeline. Select Data asset and then click all three data assets which ingested in previous step and then click Add.

You should be able to see all three data assets as shown below.

Now search funnel in the search box and drag and drop Funnel Stage to the canvas.

Create links from all data assets to Funnel Stage as shown in below picture.

Double click on all three data assets one by one and then click Output tab and then click Edit column to verify Data type, length and nullability of all columns

Search remove duplicate stage and again drag and drop the stage.

Double click on Remove Duplicate Stage to select criteria

Search sort and drag and drop Sort stage. double click on sort stage and select sort criteria by id.

Search rds and then drag and drop Amazon RDS for PostgreSQL connector to canvas.

Double click on RDS connector to specify data source and table name

Compile the datastage pipeline and run if compile succeeded.

Now we have integrated data available in Amazon RDS for PostgreSQL. Let's ingest the data from the data source

Discover and select integrated table (eg. healthcare_personnel_integrated_data_table_v1)

Specify the name to asset

Verify data asset is present in the project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds.md

ds.md

Step 1: Login to IBM Cloud Pak for Data

2. Create new project

3. Create new connections with external data sources.

Files

ds.md

Latest commit

History

ds.md

File metadata and controls

Step 1: Login to IBM Cloud Pak for Data

2. Create new project

3. Create new connections with external data sources.