This tutorial describes how to use the Azure Data Factory to transform and load a single file into the Azure SQL Database, following a common data pipeline architecture. The pipeline that you create in this tutorial un-pivots global CO2 emission data, stores them in an Azure storage account and ingests them into the Azure SQL Database. This tutorial teaches the following steps in the Azure Data Factory.
- Create a new linked service.
- Create a new dataset.
- Create a data flow to transform the dataset.
- Create a pipeline to run the data flow.
- Run the pipeline.
- Monitor the pipeline run.
- Review results
- Data Factory
- Two storage accounts
- The first storage account has a
data
container containing theAPI_EN.ATM.CO2E.KT_DS2_en_csv_v2_3358949.csv
file. - The second storage account has a
conformed
container. EnsureHierarchical namespace
is enabled.
- The first storage account has a
- Azure SQL Database (Serverless)
- (Optional) Azure DevOps repository
Suppose that there is a raw file (global CO2 emission), which has been extracted from a source database and landed in the raw layer
. The picture below illustrates a typical architecture of data pipelines and the red rectangle is covered in this section, where the raw file is transformed (i.e. unpivot) and loaded to a blob storage.
-
Create a new Linked service.
a. Before transforming the raw file, a linked service should be created to read the file. Click
Manage
and thenLinked services
on the left pane. ClickNew
to create a new linked service. Optionally, create a new feature branch if the Data Factory is integrated with Git.b. Select
Azure Blob Storage
andContinue
.c. Enter
Name
and select thestorage account name
, where the raw file is stored, to configure the linked service details as below.d. Click
Test connection
to check if the linked service is set up properly, and clickCreate
. -
Create a new Dataset.
a. Click
Author
on the left menu, click+ (plus)
button and then selectDataset
.b. Select
Azure Blob Storage
and clickContinue
.c. Select
DelimitedText
and clickContinue
.d. Enter
Name
, select "LS_BS_DF1" in theLinked service
and clickBrowse
to select the raw file,API_EN.ATM.CO2E.KT_DS2_en_csv_v2_3358949.csv
. TickFirst row as header
, selectFrom sample file
and then clickBrowse
to select the header fileheaders.csv
. ClickOK
.e. (Optionally) If you have integrated with Azure DevOps, you can see the
co2
dataset appears in the repository when you save it.f. Note that
Preview data
do NOT show data in the correct form because the actual data starts at the 6th row.g. Verify that the headers appear correctly by clicking
Schema
. -
Transform Data
a. Create a data flow to transform the raw file. Click
+ (plus)
button and then selectData flow
.b. Click
Add Source
to configure the raw file.c. Select the
co2
dataset and enter4
in theSkip line count
.d. Click
Data flow debug
on the top menu and selectData preview
to check if the raw data appear correctly.e. Click
+
sign next tosource1
and selectUnpivot
.f. Add
Country Name, Country Code, Indicator Name and Indicator Code
.g. Enter
year
and setinteger
in theUnpivot key
.h. Set the unpivoted columns as
co2_emissions
withstring
data type. TickDrop rows with null
i. Click
Data preview
to check if the raw data appear correctly.j. Click
+
sign next toUnpivot1
and selectSink
.k.
Sink
defines the output data, un-pivoted CO2 emission. First, create a new linked service,LS_BS_DF2
likeLS_BS_DF1
to store conformed files in an Azure storage account (i.e. Conformed Layer in the pipeline architecture).l. Create a new dataset,
co2_unpivoted
likeco2
dataset.m. Select the
co2_unpivoted
dataset.n. Select
Output to single file
and enterco2_unpivoted.csv
file name.o. Click
Data preview
to check if the data appear correctly.p. Enter the data flow name,
unpivot_dataflow
and clickSave
. -
Create a pipeline to orchestrate the data flow.
a. Click
+ (plus)
button and selectPipeline
.b. Drag and drop
Data flow
onto the right canvas.c. Name
Unpivot
and selectunpivot_dataflow
in theData flow
.d. Click
Debug
button to run the pipeline and verify the pipeline runs successfully.e. Also, check if the
co2_unpivoted.csv
file is created successfully in the conformed blob storage.
Now the raw file (global CO2 emission) is transformed and stored in the format to be ingested in a database. The picture below illustrates a typical architecture of data pipelines and the red rectangle is covered in this section, where the transformed file is loaded to a database.
-
First, create a linked service to connect to the Azure SQL database.
a. Click
Manage
andNew
to create a new linked service.b. Click
Azure SQL Database
andContinue
.c. Enter
LS_SQLDW
and the authentication details.d. Test the connection. If it fails due to firewall, add the client IP address to the Azure SQL database and
Save
. -
Go back to the Data Factory and create a new dataset,
co2_sqldw
, which creates a table,dev.co2
in the database.a. Click
Author
, select+ (plus)
button and thenDataset
. SelectAzure SQL Database
and clickContinue
.b. Enter
co2_sqldw
and select the linked service,LS_SQLDW
created above. Setdev.co2
in the table name. ClickOK
andSave
. -
Drag and drop a
Copy data
activity onto the pipeline. Name itLoad to DW
and connect fromUnpivot
toLoad to DW
. -
Select
co2_unpivoted
in theSource dataset
. ClickPreview data
to ensure that the data appears correctly. Note that we are selectingWildcard file path
and enter* (asterisk)
even though we have only one file,co2_unpivoted.csv
. -
Configure
Sink
. Selectco2_sqldw
dataset andAuto create table
. -
Click
Save
and pressDebug
button to test the pipeline. -
Monitor the pipeline and check the result in the database by running a query,
select top 1000 * from dev.co2
.