Skip to content

Latest commit

 

History

History
502 lines (301 loc) · 15.5 KB

module02a.md

File metadata and controls

502 lines (301 loc) · 15.5 KB

Module 02A - Incremental Copy to Raw (using High Watermark)

< Previous Module - Home - Next Module >

⏱️ Estimated Duration

20 minutes

🤔 Prerequisites

  • Lab environment deployed
  • Module 1A (Linked Service, Integration Datasets)

📢 Introduction

In this module, we will setup a Synapse Pipeline to incrementally copy customer orders data from an OLTP source (Azure SQL Database), to the raw layer of a Data Lake (Azure Data Lake Storage Gen2), referencing a high watermark value to isolate changes.

flowchart LR
ds1[(Azure SQL DB\nHigh Watermark)]
ds2[(Data Lake\nraw)]
ds1-.->a1
ds1-.->a2
ds1-.->a3
ds1-."source\ndbo.Orders".->a5
a5-."sink\n01-raw/wwi/orders/$fileName.csv".->ds2
subgraph p["Pipeline (O1 - pipelineIncrementalCopyWatermark)"]
a1[Lookup\ngetOldWatermark]
a2[Lookup\ngetNewWatermark]
a3[Lookup\ngetChangeCount]
sg[If Condition\nhasChangedRows]
a1-->a3
a2-->a3
a3-->sg
    subgraph sg[If Condition\nHasChangedRows]
    a5[Copy data\nincrementalCopy]
    a6[Stored procedure\nupdateWatermark]
    a5-->a6
    end
end
Loading

🎯 Objectives

  • Prepare source system to store and update a watermark value
  • Create a Pipeline
  • Copy data changes to the data lake

Table of Contents

  1. Source Environment (dbo.Orders)
  2. Pipeline (Lookup - getOldWatermark)
  3. Pipeline (Lookup - getNewWatermark)
  4. Pipeline (Lookup - getChangeCount)
  5. Pipeline (If Condition)
  6. Pipeline (Copy data)
  7. Pipeline (Stored procedure)

1. Source Environment (dbo.Orders)

Initialize the source environment:

  • Create a table dbo.Orders and populate the table with some data
  • Create a SQL trigger that will automatically update the LastModifiedDateTime column on dbo.Orders when an UPDATE occurs
  • Create a watermark table dbo.Watermark to track the maximum LastModifiedDateTime from the last successful load
  • Create a SQL procedure to update the watermark table upon the completion of a successful load
  1. Navigate to the SQL database

    ALT

  2. Click Query editor

    ALT

  3. Copy and paste your Login and Password from the code snippets below

    Login

    sqladmin
    

    Password

    sqlPassword!
    

    ALT

  4. To create the source table, copy and paste the code snippet below and click Run

    CREATE TABLE Orders (
        OrderID int IDENTITY(1,1) PRIMARY KEY,
        CustomerID int FOREIGN KEY REFERENCES Customers(CustomerID),
        Quantity int NOT NULL,
        OrderDateTime DATETIME default CURRENT_TIMESTAMP,
        LastModifiedDateTime DATETIME default CURRENT_TIMESTAMP
    );
    INSERT INTO dbo.Orders (CustomerID, Quantity)
    VALUES
        (1,38),
        (2,27),
        (3,16),
        (1,52);

    ALT

  5. To create a SQL trigger that will automatically update the LastModifiedDateTime colum on UPDATE, copy and paste the code snippet below and click Run

    CREATE TRIGGER trg_orders_update_modified
    ON dbo.Orders
    AFTER UPDATE 
    AS
        UPDATE dbo.Orders
        SET LastModifiedDateTime = CURRENT_TIMESTAMP
        FROM Inserted i
        WHERE dbo.Orders.OrderID = i.OrderID;

    ALT

  6. To initialize the watermark table, copy and paste the code snippet below and click Run

    CREATE TABLE Watermark (
        TableName varchar(255),
        Watermark DATETIME
    );
    INSERT INTO dbo.Watermark
    VALUES
    ('dbo.Orders', '1/1/2022 12:00:00 AM');

    ALT

  7. To enable the ability to programmatically update the watermark value via a stored procedure, copy and paste the code snippet below and click Run

    CREATE PROCEDURE sp_update_watermark @LastModifiedDateTime datetime, @TableName varchar(50)
    AS
        UPDATE Watermark
        SET [Watermark] = @LastModifiedDateTime
        WHERE [TableName] = @TableName;

    ALT

2. Pipeline (Lookup - getOldWatermark)

In this step, we will create a pipeline O1 - pipelineIncrementalCopyWatermark to incrementally copy order data from Azure SQL Database to Azure Data Lake Gen2. The first activity in our pipeline will be a Lookup which will query the dbo.Watermark table to retrieve the current watermark value.

  1. Navigate to the Synapse workspace

    ALT

  2. Open Synapse Studio

    ALT

  3. Navigate to the Integrate hub

    ALT

  4. On the right hand side of Pipelines, click the [...] ellipsis icon and select New folder

    ALT

  5. Copy and paste the Folder name from the snippet below and click Create

    Orders
    

    ALT

  6. On the right hand side of the Orders folder, click the [...] ellipsis icon and select New pipeline

    ALT

  7. Rename the pipeline to O1 - pipelineIncrementalCopyWatermark

    ALT

  8. Within Activities, search for Lookup, and drag the Lookup activity onto the canvas

    ALT

  9. Rename the activity getOldWatermark

    ALT

  10. Switch to the Settings tab and set the Source dataset to AzureSqlTable

    ALT

  11. Set the Dataset property schema to dbo

    ALT

  12. Set the Dataset property table to Watermark

    ALT

  13. Set the Use query property to Query, click inside the Query text, and copy and paste the code snippet

    SELECT * FROM Watermark WHERE TableName = 'dbo.Orders'

    ALT

  14. Click Preview data to confirm the query is valid

    ALT

3. Pipeline (Lookup - getNewWatermark)

In this step, we will add a second Lookup activity to calculate a new watermark value based on the MAX LastModifiedDateTime from the dbo.Orders table.

  1. Within Activities, search for Lookup, and drag the Lookup activity onto the canvas

    ALT

  2. Rename the activity getNewWatermark

    ALT

  3. Switch to the Settings tab and set the Source dataset to AzureSqlTable

    ALT

  4. Set the Dataset property schema to dbo

    ALT

  5. Set the Dataset property table to Orders

    ALT

  6. Set the Use query property to Query, click inside the Query text, and copy and paste the code snippet

    SELECT MAX(LastModifiedDateTime) as NewWatermarkValue FROM dbo.Orders

    ALT

  7. Click Preview data to confirm the query is valid

    ALT

4. Pipeline (Lookup - getChangeCount)

In this step, we will add a third Lookup activity to calculate the number of new records (changeCount) between the watermark values.

  1. Within Activities, search for Lookup, and drag the Lookup activity onto the canvas

    ALT

  2. Rename the activity getChangeCount

    ALT

  3. Click and drag on the green button from each Lookup activity (getOldWatermark and getNewWatermark) to establish a connection to the new Lookup activity (getChangeCount)

    ALT

  4. Switch to the Settings tab and set the Source dataset to AzureSqlTable

    ALT

  5. Set the Dataset property schema to dbo

    ALT

  6. Set the Dataset property table to Orders

    ALT

  7. Set the Use query property to Query, click inside the Query text, and copy and paste the code snippet

    SELECT COUNT(*) as changecount FROM dbo.Orders WHERE LastModifiedDateTime > '@{activity('getOldWatermark').output.firstRow.Watermark}' and LastModifiedDateTime <= '@{activity('getNewWatermark').output.firstRow.NewWatermarkValue}'

    ALT

  8. Click Debug

    ALT

  9. Once the pipeline has finished running, under Output, hover your mouse over the getChangeCount activity and click the Output icon. You should see a changecount property with a value of 4.

    ALT

5. Pipeline (If Condition)

In this step, we will add an If Condition that will be satisfied if the change count is greater than zero.

  1. Within Activities, search for If, and drag the If condition activity onto the canvas

    ALT

  2. Rename the activity hasChangedRows

    ALT

  3. Click and drag on the green button on the previous Lookup activity (getChangeCount) to establish a connection to the If Condition activity

    ALT

  4. Switch to the Activities tab, click inside the Expression text input, and click Add dynamic content

    ALT

  5. Copy and paste the code snippet and click OK

    @greater(int(activity('getChangeCount').output.firstRow.changecount),0)

    ALT

  6. Within the True case, click the pencil icon

    ALT

6. Pipeline (Copy data)

In this step, we are going to add a Copy data activity within the If Condition that will copy the new order data from the Azure SQL Database to the Azure Data Lake Storage Gen2 account.

  1. Within Activities, search for Copy, and drag the Copy data activity onto the canvas

    ALT

  2. Rename the activity incrementalCopy

    ALT

  3. Switch to the Source tab and set the Source dataset to AzureSqlTable

    ALT

  4. Under Dataset properties, set the schema to dbo

    ALT

  5. Under Dataset properties, set the table to Orders

    ALT

  6. Set Use query to Query, click inside the Query text input, and click Add dynamic content

    ALT

  7. Copy and paste the code snippet and click OK

    SELECT * FROM dbo.Orders WHERE LastModifiedDateTime > '@{activity('getOldWatermark').output.firstRow.Watermark}' and LastModifiedDateTime <= '@{activity('getNewWatermark').output.firstRow.NewWatermarkValue}'

    ALT

  8. Switch to the Sink tab and set the Source dataset to AdlsRawDelimitedText

    ALT

  9. Under Dataset properties, set the folderPath to wwi/orders

    ALT

  10. Under Dataset properties, click inside the fileName text input and click Add dynamic content

    ALT

  11. Copy and paste the code snippet and click OK

    @concat(formatDateTime(pipeline().TriggerTime,'yyyyMMddHHmmssfff'),'.csv')

    ALT

7. Pipeline (Stored procedure)

In this step, we are going to add a Stored procedure activity that will update the watermark table with the new watermark value.

  1. Within Activities, search for Stored, and drag the Stored procedure activity onto the canvas

    ALT

  2. Rename the activity updateWatermark

    ALT

  3. Click and drag on the green button from the Copy data activity to establish a connection to the Stored procedure activity

    ALT

  4. Switch to the Settings tab and set the Linked service to AzureSqlDatabase

    ALT

  5. Set the Stored procedure name to [dbo].[sp_update_watermark]

    ALT

  6. Under Stored procedure parameters, click Import

    ALT

  7. Click inside the LastModifiedDateTime value text input and click Add dynamic content

    ALT

  8. Copy and paste the code snippet and click OK

    @{activity('getNewWatermark').output.firstRow.NewWatermarkValue}

    ALT

  9. Click inside the TableName value text input and click Add dynamic content

    ALT

  10. Copy and paste the code snippet and click OK

    @{activity('getOldWatermark').output.firstRow.TableName}

    ALT

  11. Click Publish all

    ALT

  12. Click Publish

    ALT

  13. Navigate back to the pipeline and click Debug

    ALT

  14. Periodically click Refresh until all the activities within the pipeline have succeeded

    ALT

  15. Navigate to the Data hub, browse the data lake folder structure under the Linked tab to 01-raw/wwi/orders, right-click the newest CSV file and select New SQL Script > Select TOP 100 rows

    ALT

  16. Modify the SQL statement to include HEADER_ROW = TRUE within the OPENROWSET function and click Run

    ALT

🎉 Summary

You have successfully setup a pipeline that can check for changes in the source system by referencing the last high watermark, and copy those changes to the raw layer within your data lake.

✅ Results

Azure SQL Database

  • CREATE TABLE Orders
  • INSERT INTO dbo.Orders
  • CREATE TRIGGER trg_orders_update_modified
  • CREATE TABLE Watermark
  • INSERT INTO dbo.Watermark
  • CREATE PROCEDURE sp_update_watermark

Azure Synapse Analytics

  • 1 x Pipeline (O1 - pipelineIncrementalCopyWatermark)

Azure Data Lake Storage Gen2

  • 1 x CSV file (01-raw/wwi/orders)

Continue >