Note
Timebox: 75 minutes
Back to Agenda | Back to Exercise 1 | Up next Exercise 3
- Task 2.1 Copilot for notebooks
- Task 2.2 Different ways to get data from the lakehouse
- Task 2.3 Side Loading (local upload) and Load to Delta for CSV file
- Task 2.4 Import pre-made notebook
- Task 2.5 Attach the bronze Lakehouse
- Task 2.6 Create a silver lakehouse
- Task 2.7 Follow the Notebook
- Task 2.8 Automation
- Task 2.9 Confirm End Results
- Task 2.10 Create Gold Lakehouse (required for Exercise 4)
In this exercise, we will explore data engineering tasks aimed at transforming raw data into a refined silver layer.
By the end of the workshop, we will have completed the implementation of the medallion architecture:
Note
Fabric's intelligent compute resources are dynamically adjusted based on historical usage, peak demands, and current activity levels. With nearly 600 of us today working simultaneously, primarily within the same region, startup times for Spark compute instances may be longer than usual. Typically, our starter pool initiates new Spark sessions in about 10 seconds. However, due to today's high volume, we may transition to the on-demand pool, resulting in wait times of approximately 2 to 3 minutes for some sessions.
Tip
Take a moment to check your timing and plan to complete Exercise 2. If you are concerned about delays and need a thorough walkthrough of each required exercise, consider skipping Task 2.1 (Copilot for notebooks) and proceed directly to Task 2.2 Different ways to get data from the lakehouse.
Remember, you can always return to this exercise later.
Click here to expand the Task 2.1 Copilot for notebooks
With Copilot for Data Science and Data Engineering, you can chat with an AI assistant that can help you handle your data analysis and visualization tasks. You can ask the Copilot questions about lakehouse tables, Power BI Datasets, or Pandas/Spark dataframes inside notebooks. Copilot answers in natural language or code snippets. Copilot can also generate data-specific code for you, depending on the task. For example, Copilot for Data Science and Data Engineering can generate code for:
- Chart creation
- Filtering data
- Applying transformations
- Machine learning models
Investigate the green201501
table within your lakehouse, and seek insights about the dataset. Additionally, inquire how to compute the average trip distance and fare amount by each payment type.
Select the Copilot icon found in the notebooks ribbon. This action opens the Copilot chat panel and creates a new cell at the top of your notebook. Note: This cell is essential for initializing a Spark session within a Fabric notebook and must be executed for Copilot to function correctly. Future releases may introduce alternative initialization methods.
When the Copilot panel opens, click Get Started
to initiate your interaction with the AI assistant.
Copilot will automatically insert a new cell containing the necessary library installation script. Execute this cell by clicking the Play
button to install the required libraries for Copilot functionalities.
Post-installation, you will be presented with a note on data privacy and security. Please read through this to understand how your data is stored and processed. Additionally, guidelines will be provided on how to effectively communicate with Copilot.
Now, engage with Copilot by exploring various prompts related to your data. Feel free to ask for code snippets, clarification, or paste the generated code into a new notebook cell. This is an opportunity to explore the capabilities and assistance Copilot offers for data science and engineering tasks.
[!TIP] Please feel free to ask Copilot any questions to deepen your understanding of the dataset and enhance our workshop experience. Just an example:
Analyze my lakehouse table named green201501 and provide insights about the data
.
This quick demonstration aims to highlight the ease of accessing Copilot for insightful data analysis.
[!IMPORTANT]
Copilot is here to support and guide, to "co-pilot", not to control and "pilot". We'll move forward with our structured activities and that documentation, allowing them to serve as our copilot instead of using Copilot directly.
This task focuses on various methods for extracting data from the lakehouse into your notebook for analysis. Below are step-by-step instructions to perform this in your notebook.
Remember, to execute code within a cell, use CTRL + Enter on Windows or ⌘ + Enter on MacOS. Alternatively, the Run
icon (
Enter the following PySpark code in a new cell in your Fabric notebook. This script will retrieve data from a specified lakehouse table. Make sure to replace bronzerawdata
and green202301 with your lakehouse and table names if they differ.
Tip
This is a repetition of Exercise 1.3.11, but with additional explanation. Skip it if you do not need the code explanation.
df = spark.sql("SELECT * FROM bronzerawdata.green202301 LIMIT 1000")
display(df)
Explanation of the code:
df = spark.sql("SELECT * FROM bronzerawdata.green202301 LIMIT 1000")
- This line of code uses the spark.sql()
function to run an SQL query on a table called green202301
located in the lakehouse bronzerawdata
. The query selects all columns (*)
from the table and limits the result to the first 1000 rows with the LIMIT 1000
clause. The result of the query is then stored in a PySpark DataFrame called df
. display(df)
- the display()
function is used to visualize the contents of a DataFrame in a tabular format. In this case, it visualizes the contents of the df DataFrame created in the previous line.
Fabric Notebooks support various programming languages, including PySpark, Scala, SQL, and R. To switch to SQL, for example, use the %%sql magic command at the beginning of a notebook cell.
%%sql
SELECT * FROM bronzerawdata.green202301 LIMIT 1000
Now, let's execute a specific data selection command. This command filters specific columns from the DataFrame and displays the first five rows:
%%pyspark
df.select("VendorID", "trip_distance", "fare_amount", "tip_amount").show(5)
The code df.select("VendorID", "trip_distance", "fare_amount", "tip_amount").show(5)
is used to display the first five rows of a DataFrame called df, and only the columns named: vendorID
, tripDistance
, fareAmount
, tipAmount
. This is a useful function when working with large datasets to quickly inspect the data and ensure that it has been loaded correctly.
When working with large datasets, starting with data retrieval sets the foundation for subsequent data analysis tasks, which may include filtering, sorting, and aggregating data. As you delve deeper, you may encounter more complex data engineering tasks such as cleansing, transformation, and aggregation, essential for advanced data analysis and insights extraction.
We aim to expand the bronze layer by loading additional data. Below is a table that provides an updated view of the types of data we are loading into the bronze layer and the methods we are using for this purpose.
This set of instructions will guide you through the process of downloading external data and integrating it into your Lakehouse for comprehensive analysis.
Open Link in a New Tab provided URL and download a CSV file containing information on discounts applied to users on a specific date. This data is vital for comprehensive analysis and is generated for your convenience.
Tip
Download the file to your local machine from this link Download Discount Data.
To integrate this discount data with existing datasets:
- Go to the
Files
section in your Lakehouse. - Click on the three dots to access additional options and select the
Upload
button. - Choose
Upload Files
from the menu.
Select the recently downloaded file, likely named NYC-Taxi-Discounts-Per-Day.csv, then initiate the upload by clicking the Upload
button.
The file should upload within a few seconds. This method provides a straightforward approach to augmenting your Lakehouse data.
Refresh the Lakehouse's Files
section to view the newly uploaded file. Employ the drag-and-drop feature to move this file into your notebook. This action will generate a cell prepopulated with code, which you can execute to review the new data.
Assign an appropriate name to your notebook reflecting its purpose, such as Data Exploration
or Discount Analysis
, to maintain clarity and organization within your projects.
Finally, transition to the Data Engineering
tab, adhering to the instructions depicted on-screen, to continue your data analysis journey with the newly integrated datasets.
Note
You can import one or more existing notebooks from your local computer to a Fabric workspace from the Data Engineering or the Data Science homepage. Fabric notebooks recognize the standard Jupyter Notebook .ipynb files, and source files like .py, .scala, and .sql, and create new notebook items accordingly.
If you have not downloaded the repository (Step 16. Download the exercise files), you can download just a separate notebook. This screenshot presents the steps to do that.
Ensure you are in the Data Engineering
context of your Fabric workspace. Then, navigate to your workspace, and select the New
where you'll find an option to upload notebooks, symbolized by a notebook icon.
Click this icon to open the upload sidebar, similar to how you previously uploaded a file. From here, choose the notebook you've recently downloaded, named notebook-2.ipynb, and initiate the upload.
Once you start the upload, you'll receive a notification indicating that the import of the file is underway. Wait for this process to complete; it typically takes only a few moments.
After the import completes, locate the newly imported notebook in the urban-innovation-de{NNN}
, where NNN represents the number assigned to you. Click on the three dots associated with the notebook and select Open Notebook
. For convenience, you can open the notebook in the background, which will make its icon continuously accessible from the vertical sidebar on the left.
Congratulations, you've successfully completed the task and enhanced your data engineering capabilities with a pre-made notebook!
This step-by-step guide will help you to integrate your Lakehouse with the pre-made notebook for effective data manipulation and analysis.
In your opened notebook, locate the section referring to Lakehouses, typically shown in the screenshot provided within the notebook. Click on this section to view your Lakehouse options.
Within the Lakehouse options, click on the Add
button to initiate the process of linking a Lakehouse to your notebook.
Choose the option to select an Existing Lakehouse
from the available choices. After making this selection, click the Add
button to proceed.
From the list of available Lakehouses, identify and select your own, named bronzerawdata
. Be careful to choose the correct one to ensure accurate data analysis. Once confirmed, click Add
to attach it to your notebook.
Verify that your Lakehouse, bronzerawdata
, is now correctly linked and visible within your notebook settings. This confirmation ensures that you are all set for executing data-related tasks within the notebook.
The last task before fully immersing ourselves in data engineering work within the notebook is to create and attach a new Silver Lakehouse. Please follow these steps:
2.6.1. Click on the pin icon next to the Default Lakehouse, bronzerawdata
. Then select Add Lakehouse
.
2.6.3. Follow the naming convention and enter a name for the Lakehouse. The suggested name is silvercleansed
.
2.6.4. Confirm that your notebook is now linked to two Lakehouses: the default one (bronze) and the newly added one (silver). Once this is verified, we can begin our data engineering work.
In this task, follow the notebooks and the code provided, as well as all the instructions written in the code. Execute all code cells there, and all the steps..
Important
Fabric Spark enforces a cores-based throttling and queueing mechanism, where users can submit jobs based on the purchased Fabric capacity SKUs. The queueing mechanism is a simple FIFO-based queue, which checks for available job slots and automatically retries the jobs once the capacity has become available. When users submit notebook or lakehouse jobs like Load to Table when their capacity is at its maximum utilization due to concurrent running jobs using all the Spark Vcores available for their purchased Fabric capacity SKU, they're throttled with the message HTTP Response code 430: Unable to submit this request because all the available capacity is currently being used. The suggested solutions are to cancel a currently running job, increase the available capacity, or try again later.
You will complete the 2.7 notebook task once you see the last code cell that will lead you back here to Task 2.8.
Tip
If you have completed Notebook 2.7 and are seeking advanced challenges, please review the notebook and list all the improvements you would suggest. I have intentionally left a couple of areas with room for improvement. Once you identify those areas, feel free to discuss them with the instructors.
Please repeat the exercise from 1.4 Management of Spark Sessions, and check if you have any additional ongoing and active Spark sessions. If so, please cancel them.
Congratulations, you have completed the advanced data engineering notebook. Now, let's focus on automation. Considering we have just two tables, imagine the scenario where you need to process 50 tables or 50 different Parquet data sources. In such cases, the most efficient approach is to build and prioritize a data pipeline. This is the aim of the task at hand – to establish automation.
Tip
Refresh your lakehouse (use the three dots "..." and then click the refresh button) to double-check if the tables are in the lakehouse.
Ensure the following before starting the automation process:
- Lakehouse
bronzerawdata
contains two tables:green201501
andgreen202301
. - Lakehouse
bronzerawdata
includes one folder in the Files section, named2023
, created by a shortcut. - Lakehouse
bronzerawdata
has one file in the Files section:NYC-Taxi-Discounts-Per-Day.csv
. - Lakehouse
silvercleansed
consists of three tables:green2015_avg_fare_per_month
,green201501_cleansed
, andgreen201501_discounts
.
Once all is set, proceed with the automation.
Navigate to Home
as depicted in the instructions.
Click on Data Pipeline
and name the new pipeline Bronze2Silver
.
Select the ForEach
activity as shown on the screen.
Provide a name for each ForEach
element.
Firstly, click on the background pane (first step from the screenshot in the pink rectangle) to see the tab with parameters and variables.
In the pipeline settings tab, navigate to Variables
. Here, create a new variable named table_name
, set its type to Array
, and assign the default value ["green201501", "green202301"]
. Follow the specific step presented in the screenshot.
In the ForEach
settings, select Sequential
. To add dynamic content, open the sidebar and select the table_name
variable. Confirm by clicking OK
. Follow the specific step presented in the screenshot.
Note
Sequential specifies whether the loop should be executed sequentially or in parallel. Maximum of 50 loop iterations can be executed at once in parallel. For example, if you have a ForEach activity iterating over a copy activity with 10 different source and sink datasets with isSequential set to False, all copies are executed at once.
And this "Sequential" mode may not be the best strategy for this task, especially since we want to run the same notebook for two different tables in parallel. If you arrived at the same conclusion, then that's a very astute observation. Feel free to chat about it with instructors.
Under Activities
, choose Notebook
.
From the General Tab (presented in the screenshot), go the Settings
tab for the notebook, as illustrated.
Opt for the notebook-2
that you uploaded previously.
Add a new parameter named table_name
, with type String
and value @item()
. @item()
comes again from dynamic content (Pipeline expression builder).
After configuring, click Validate
to ensure there are no errors. Click on Run
.
Save the settings and initiate the run by clicking the Run
button.
Note that the execution of the two notebooks occurs sequentially, typically taking two minutes each. However, as these notebooks do not depend on each other, consider modifying the pipeline to run the notebooks in parallel for efficiency.
Tip
(1) Instead of iterating through notebooks with a ForEach loop, you may consider structuring your approach to input various values into a single notebook execution. This can be complemented by coding loops within the notebook itself.
(2) Utilize the mssparkutils.notebook.runMultiple()
method to execute multiple notebooks in parallel, enhancing efficiency and saving time. This method is particularly useful when you do not need to wait for each notebook to complete before starting another. For an introduction and detailed usage, execute mssparkutils.notebook.help("runMultiple")
. Here’s an illustrative example:
mssparkutils.notebook.runMultiple(["NotebookA", "NotebookB"])
Explore more about this feature here.
Congratulations on completing this significant milestone in data engineering automation! Your skills in automating data processes have now been markedly enhanced.
Upon completing Exercises 1 and 2, it's crucial to verify the following outcomes in your Lakehouse environments:
- Tables: Ensure there are two tables present:
green201501
andgreen202301
. - Files Section: Confirm there is one folder named
2023
, created via the shortcut. - File Existence: Verify there is one file in the Files section:
NYC-Taxi-Discounts-Per-Day.csv
.
- Tables: Check that there are six tables:
green201501_avg_fare_per_month_2015_01
green201501_cleansed
green201501_discounts
green202301_avg_fare_per_month_2023_01
green202301_cleansed
green202301_discounts
.
Review the screenshots provided to compare and confirm the setup in your Lakehouses matches the expected structure.
The final task before delving into data science work (Exercise 4) within the notebook is to create a new Gold Lakehouse named goldcurated
.
Please follow these steps:
2.10.1. From the view of all artifacts you have created inside your workspace, click on the New
button.
2.10.3. Adhere to the naming convention and enter a name for the Lakehouse. The suggested name is goldcurated
.
Important
Once completed, go to next exercise (Exercise 3). If time permits before the next exercise begins, consider continuing with extra steps.