In this exercise you will combine product and review data to a joint data set that can be used for analytics.
- You will learn how to explore the data sets with the Meta-Data Explorer
- You will do the first steps with the Pipeline Modeler to join two data sets and write the result as CSV to the S3 blob store.
- In the Factsheet view choose the Data Preview Tab to see a sample of the actual data.
The review data set has a
PRODUCT_ID
to refer to the actual product in our master data. The review in textual format is available in theREVIEW_TEXT
column.
- In the Data Preview tab you see a sample of the table. You recognize the
PRODUCTID
column which we can use to join the product details with the reviews.
- The resource panel on the left will have the vertical graph tab selected. Here, you see a couple of
example graphs shipped with SAP DI. We will later also find our own graphs in this resource panel.
Click the
+
icon to create a new graph...
- Fill the graph name (unique) and the description similar to the screenshot (e.g., data_preparation).
- Now, select the Operator tab and search for
file consumer
. The file consumer in the Structured Data Operators category will be shown. Select the operator and drag into the graph.
- By selecting the operator you will see that the Configuration Panel on the right
will switch to the operator specific parameters. For Storage Type choose
S3
, since we want to consume the review file from a S3 connection.
- Now, click on the path parameter to choose the file. You will see a file browser
with a list of files. Choose the file
Product_Reviews
.
- After confirming the file, you can click the Data Preview link to show a sample of the data
similar to the Factsheet in the Meta-data Explorer...
- Select the Table Consumer Operator under
the Structured Data Operators category and drag into the graph.
- Connect the Table Consumer output with the Data Transform operator.
You will see a port created in the Data Transform operator and a link connecting
the operators.
- Connect the upper input with the join operator. Note, the input name equals the
input port name of the Data Transform operator and is created automatically.
You can also create the port manually (show in later exercise).
- Connect the lower input with the join operator. Then, double-click on the Join operator
to open the parametrization.
- Click on the
PRODUCT_ID
column of the upper data set and drag a link to thePRODUCTID
column of the lower data set.
- The join condition will be created and shown in the bottom panel.
Change the join to be Left Outer, to have also products in the resulting
data set that have no reviews.
- Click the columns you want to have in the resulting data set.
For this exercise, click the columns marked by the blue circles in
the screenshot. The order of clicking the columns defines the
column order in the resulting table. Here, please click the columns
from top to bottom.
- In case the formatting looks off, click the Format button in the top panel (as shown in screenshot).
- Execute the graph by clicking the Run icon in the top panel. In the status
panel the pipeline will appear showing the status
pending
. This shows that the graph is scheduled on the Kubernetes environment.
- Once the graph is running, you can click the Wiretap and choose the top icon to
open the debug view.
- You will see that a single file has been produced by the File Producer.
Note, that the graph keeps running until you explicitly stop by hitting the stop icon
on the top panel.
- If we want the graph to stop automatically once the data has been consumed
we can add a Graph Terminator operator. Search for the operator, drag into the graph
and connect to the output of the file producer.
- You can now check for the produced file using the Meta-data Explorer.
For this, simply go to the browsing section again and look for the filename provided
by you in the file producer.
- In the Data Preview tab you will see the joined data. Most likely, however, the
column names will not be shown.
- To have the names beeing added to the CSV file we need to go back to the graph
and go to the configuration panel of the file producer.
Congratulations. You've created a re-usable pipeline to join the review with the product data and stored it for further analysis.
Continue to Exercise 2