diff --git a/ADA/ada.qmd b/ADA/ada.qmd index d40d7bd..fbca07e 100644 --- a/ADA/ada.qmd +++ b/ADA/ada.qmd @@ -1,5 +1,5 @@ --- - title: "Analytical Data Access (ADA) and databricks" + title: "Analytical Data Access (ADA) and Databricks" ---
Guidance for analysts on how to interact with and use data stored in ADA using databricks
diff --git a/ADA/databricks_fundamentals.qmd b/ADA/databricks_fundamentals.qmd index d43fa27..f32ba15 100644 --- a/ADA/databricks_fundamentals.qmd +++ b/ADA/databricks_fundamentals.qmd @@ -4,11 +4,9 @@ title: "Databricks fundamentals" ------------------------------------------------------------------------ -# What is Databricks? +## What is Databricks? ------------------------------------------------------------------------- - -Databricks is a web based platform for large scale data manipulation and analysis using code to create reproducible data pipelines. Primarily it takes the form of a website which you can create data pipelines and perform analysis in. It currently supports the languages R, SQL, python and scala, and integrates well with Git based version control systems such as GitHub or Azure DevOps. +Databricks is a web based platform for large scale data manipulation and analysis using code to create reproducible data pipelines. Primarily it takes the form of a website which you can create data pipelines and perform analysis in. It currently supports the languages R, SQL, Python and Scala, and integrates well with Git based version control systems such as GitHub or Azure DevOps. Behind the scenes it is a distributed cloud computing platform which utilizes the [Apache Spark engine](https://spark.apache.org/) to split up heavy data processing into smaller chunks. It then distributes them to different 'computers' within the cloud to perform the processing of each chunk in parallel. Once each 'computer' is finished processing the results are recombined and passed back to the user or stored. @@ -22,6 +20,8 @@ In addition, it also provides new tools within the platform to construct and aut Underpinning the technology are some key differences in how computers we're familiar with, and Databricks (and distributed computing in general) are structured. +--- + ### Traditional computing ------------------------------------------------------------------------ @@ -35,9 +35,11 @@ Currently, we are used to using a PC or laptop to do our data processing. A trad ![](/images/ada-traditional-computer.jpg){width="273"} -Any traditional computer is limited by it's hardware meaning there is an upper limit on the size and complexity of data it can process. +Any traditional computer is limited by its hardware meaning there is an upper limit on the size and complexity of data it can process. -In order to increase the amount of data a computer can process you would have to switch out the physical hardware of the machine for something more powerful. +In order to increase the amount of data a computer can process, you would have to switch out the physical hardware of the machine for something more powerful. + +--- ### On Databricks @@ -55,11 +57,11 @@ The storage and computation are separated into different components rather than ------------------------------------------------------------------------ -- **Scalable** - if you need more computing power you can increase your computing power and only pay for what you use rather than having to build an expensive new machine -- **Centralised** - All data, scripts, and processes are available in a single place and access for any other user can be controlled by their author, or the wider Department as required. -- **Data Governance** - The Department is able to 'see' all of it's data and organisational knowledge. This enables it to ensure it is access controlled and align with GDPR and data protection legislation and guidelines. -- **Auditing and version control** - The Platform itself generates a lot of metadata which enables it to keep versioned history of it's data, outputs, etc. -- **Automation** - Complex data processing pipelines can be set up using Databricks workflows and set to automatically run, either on a timer or a specific trigger allowing for a fully automated production process. +- **Scalable** - if you need more computing power, you can increase your computing power and only pay for what you use rather than having to build an expensive new machine +- **Centralised** - All data, scripts, and processes are available in a single place and access for any other user can be controlled by their author, or the wider Department as required +- **Data Governance** - The Department is able to 'see' all of its data and organisational knowledge. This enables it to ensure it is access controlled and align with GDPR and data protection legislation and guidelines +- **Auditing and version control** - The Platform itself generates a lot of metadata which enables it to keep versioned history of its data, outputs, etc +- **Automation** - Complex data processing pipelines can be set up using Databricks workflows and set to automatically run, either on a timer or a specific trigger allowing for a fully automated production process Each of these aspects bring benefits to the wider Department and for analysts within it. @@ -71,14 +73,16 @@ The auditing, and automation facilities provide a lot of benefits when building ------------------------------------------------------------------------ -# Key concepts +## Key concepts ------------------------------------------------------------------------- -## Storage + +### Storage There are a few different ways of storing files and data on Databricks. Your data, and modelling areas will reside in the 'unity catalog', whereas your scripts and code will live on your 'workspace'. +--- + ### Unity catalog ------------------------------------------------------------------------ @@ -89,6 +93,8 @@ The unity catalog can be accessed through the 'Catalog' option in the Databricks ![](/images/ada-unity-catalog-sidebar.png) +--- + #### Structure of the unity catalog ------------------------------------------------------------------------ @@ -101,19 +107,23 @@ A schema can contain any number of tables, views and volumes. ![](/images/ada-unity-catalog.jpg) +--- + #### Catalogs not databases ------------------------------------------------------------------------ The 'unity catalog' is a single catalog that contains all the other catalogs of data in the Department. Catalogs are very similar in concept to a SQL database in that they they contain schemas, tables of data and views of data. +--- + #### Schemas, tables and views ------------------------------------------------------------------------ Like a SQL database a catalog has schemas, tables, and views which store data in a structured (usually tabular) format. -A schema is a sub-division of a catalog which allows for logical separation of data stored in the catalog. Whoever creates a schema is it's owner, and is able to set fine grained permissions on who can see / edit the data within it. Permissions can also be set for groups of analysts, and can be modified by the ADA team if the original owner is no longer available. +A schema is a sub-division of a catalog which allows for logical separation of data stored in the catalog. Whoever creates a schema is its owner, and is able to set fine grained permissions on who can see / edit the data within it. Permissions can also be set for groups of analysts, and can be modified by the ADA team if the original owner is no longer available. Tables are equivalent to SQL tables, and store data in a tabular format. Tables in Databricks have the ability to turn on version control which audits each change to the data and allows a user to go back in time to see earlier versions of the table. @@ -121,6 +131,8 @@ Views look and act the same as tables, however instead of storing the data as it Tables and views sit within a schema and these are where you would store your core datasets and pick up data to analyse from. +--- + #### Volumes ------------------------------------------------------------------------ @@ -131,6 +143,8 @@ Volumes are stored under a schema within a catalog. Files in here can be accesse Examples of files suitable to be stored in a volume include CSVs, JSON and other formats of data files, or supporting files / images for applications you develop through the platform. You can also upload files to a volume through the user interface. +--- + ### Workspaces - Databricks file system (DBFS) ------------------------------------------------------------------------ @@ -151,6 +165,8 @@ Sharing code this way can be useful but has it's risks. If you allow other users For collaboration on code you should use a GitHub/DevOps repository which each user can clone and work on independently. ::: +--- + ### Repositories for version control ------------------------------------------------------------------------ @@ -163,8 +179,6 @@ To connect Databricks to a repository refer to the [Databricks and version contr ## Compute ------------------------------------------------------------------------- - In order to access data and run code you need to set up a compute resource. A compute resource provides processing power and memory to pick up and manipulate the data and files stored in the 'unity catalog'. The compute page can be accessed through the 'Compute' option in the Databricks sidebar. ![](/images/ada-compute.png) @@ -186,17 +200,19 @@ All compute options can be used both within the Databricks platform and be conne - [Setup Databricks SQL Warehouse with RStudio](databricks_rstudio_sql_warehouse.qmd) - [Setup Databricks Personal Compute cluster with RStudio](databricks_rstudio_personal_cluster.qmd) +--- + ### Creating a personal compute resource ------------------------------------------------------------------------ -1. To create your own personal compute resource click the 'Create with DfE Personal Compute' button on the compute page. +1. To create your own personal compute resource click the 'Create with DfE Personal Compute' button on the compute page\ ![](/images/ada-compute-personal.png) -2. You'll then be presented with a screen to configure the cluster. There are 2 options here under the performance section which you will want to pay attention to; Databricks runtime version, and Node type.\ +2. You'll then be presented with a screen to configure the cluster. There are 2 options here under the performance section which you will want to pay attention to; Databricks runtime version, and Node type\ \ - **Databricks runtime version** - This is the version of the Databricks software that will be present on your compute resource. Generally it is recommended you go with the latest LTS (long term support) version. At the time of writing this is '15.4 LTS'.\ + **Databricks runtime version** - This is the version of the Databricks software that will be present on your compute resource. Generally it is recommended you go with the latest LTS (long term support) version. At the time of writing this is '15.4 LTS'\ \ **Node type** - This option determines how powerful your cluster is and there are 2 options available by default:\ @@ -207,18 +223,18 @@ All compute options can be used both within the Databricks platform and be conne \ ![](/images/ada-compute-personal-create.png) -3. Click the 'Create compute' button at the bottom of the page. This will create your personal cluster and begin starting it up. This usually takes around 5 minutes.\ +3. Click the 'Create compute' button at the bottom of the page. This will create your personal cluster and begin starting it up. This usually takes around 5 minutes\ \ ![](/images/ada-compute-personal-create-button.png) -4. Once the cluster is up and running the icon under the 'State' header on the 'Compute' page will appear as a green tick.\ +4. Once the cluster is up and running the icon under the 'State' header on the 'Compute' page will appear as a green tick\ \ ![](/images/ada-compute-ready.png) ::: callout-note ## Clusters will shut down after being idle for an hour -Use of compute resources are charged by the hour, and so personal cluster have been set to shut down after being unused for an hour in order to prevent unnecessary cost to the Department. +Use of compute resources are charged by the hour, and so personal clusters have been set to shut down after being unused for an hour in order to prevent unnecessary cost to the Department. ::: ::: callout-important diff --git a/ADA/databricks_notebooks.qmd b/ADA/databricks_notebooks.qmd index f1da7bf..a777a1a 100644 --- a/ADA/databricks_notebooks.qmd +++ b/ADA/databricks_notebooks.qmd @@ -1,25 +1,22 @@ --- -title: "Databricks Notebooks" +title: "Databricks notebooks" --- ------------------------------------------------------------------------ ## Notebooks ------------------------------------------------------------------------- - Notebooks are a special kind of script that Databricks supports. They consist of code blocks and markdown blocks which can contain formatted text, links and images. Due to the ability to combine markdown with code they are very well suited to creating and documenting data pipelines, such as creating a core dataset that underpins your other products. They are particularly powerful when parameterised and used in conjuction with [Workflows](ADA/Databricks_workflows.qmd). -You can create a notebook in your workspace, either in a folder or a repository.\ -To do this locate the folder/repository you want to create the notebook in then click the 'Create' button and select Notebook. +You can create a notebook in your workspace, either in a folder or a repository. To do this locate the folder / repository you want to create the notebook in then click the 'Create' button and select Notebook. ::: callout-tip Any notebooks used for core business processes are created in a repository linked to GitHub/DevOps where they can be version controlled. ::: -Once you've created a notebook it will automatically be opened. Any changes you made are saved in real time so the notebook will always keep the latest version of it's contents. In order to 'save' a snapshot of your work it is recommended to use git commits. +Once you've created a notebook it will automatically be opened. Any changes you made are saved in real time so the notebook will always keep the latest version of its contents. In order to 'save' a snapshot of your work it is recommended to use Git commits. -You can change the title from 'Untitled Notebook *\diff --git a/ADA/databricks_rstudio_sql_warehouse.qmd b/ADA/databricks_rstudio_sql_warehouse.qmd index 9ffc6e2..7ea5061 100644 --- a/ADA/databricks_rstudio_sql_warehouse.qmd +++ b/ADA/databricks_rstudio_sql_warehouse.qmd @@ -1,5 +1,5 @@ --- - title: "Setup Databricks SQL Warehouse with RStudio" + title: "Set up Databricks SQL Warehouse with RStudio" ---
The following instructions set up an ODBC connection between your laptop and your DataBricks SQL warehouse, which can then be used in R/RStudio to query data using an ODBC based package. diff --git a/ADA/databricks_workflow_script_databricks.qmd b/ADA/databricks_workflow_script_databricks.qmd index f8ba8f1..c6a1d3d 100644 --- a/ADA/databricks_workflow_script_databricks.qmd +++ b/ADA/databricks_workflow_script_databricks.qmd @@ -1,5 +1,5 @@ --- -title: "Scripting Workflows in Databricks" +title: "Script workflows in Databricks" --- Workflows can be constructed through the Databricks Workflows user interface (UI), however for large or complex workflows the UI can be a time consuming way to build a workflow. In these scenarios it is quicker and more inline with RAP principles to script your workflow. diff --git a/ADA/databricks_workflow_script_rstudio.qmd b/ADA/databricks_workflow_script_rstudio.qmd index fdbf260..298ad56 100644 --- a/ADA/databricks_workflow_script_rstudio.qmd +++ b/ADA/databricks_workflow_script_rstudio.qmd @@ -1,5 +1,5 @@ --- -title: "Scripting Workflows in RStudio" +title: "Script workflows in RStudio" --- Workflows can be constructed through the Databricks Workflows user interface (UI), however for large or complex workflows the UI can be a time consuming way to build a workflow. In these scenarios it is quicker and more inline with RAP principles to script your workflow. diff --git a/ADA/databricks_workflows.qmd b/ADA/databricks_workflows.qmd index 14a9c38..52f2850 100644 --- a/ADA/databricks_workflows.qmd +++ b/ADA/databricks_workflows.qmd @@ -1,26 +1,25 @@ --- -title: "Databricks Workflows" +title: "Databricks workflows" --- ------------------------------------------------------------------------ ## Workflows user interface ------------------------------------------------------------------------- -Workflows allow you to build complex data pipelines by chaining together multiple scripts, queries, notebooks and logic. They can be used to build Reproducible analytical pipelines (RAP) that can be re-run with different parameters and have all inputs and outputs audited automatically. Other recommended uses of workflows are any data modelling tasks such as cleaning your source data and collating it into a more analytically friendly format in your modelling area. +Workflows allow you to build complex data pipelines by chaining together multiple scripts, queries, notebooks and logic. They can be used to build Reproducible Analytical Pipelines (RAP) that can be re-run with different parameters and have all inputs and outputs audited automatically. Other recommended uses of workflows are any data modelling tasks such as cleaning your source data and collating it into a more analytically friendly format in your modelling area. Each step in a workflow is referred to as a task and each task has dependencies. They are accessible through the 'Workflows' link on the left hand menu of the DataBricks UI. ![](/images/ada-workflow-menu.png) -Parameters can be set either at a workflow level or a task level and referred to in your scripts / notebooks, allowing you to reuse tasks/workflows for similar operations. +Parameters can be set either at a workflow level or a task level and referred to in your scripts / notebooks, allowing you to reuse tasks / workflows for similar operations. Each task can have dependencies on other tasks and can be set to only run under certain conditions, for example all of the previous tasks have completed successfully. These can be configured when tasks are added to the workflow through the user interface. -Workflows and tasks can also be configured to send notifications to users upon success/failure. These can be configured from the Workflow and Task user interfaces. +Workflows and tasks can also be configured to send notifications to users upon success or failure. These can be configured from the Workflow and Task user interfaces. -Workflows also come with robust support for git/DevOps repositories and can be set to run from a specific repo, branch, commit or tag. +Workflows also come with robust support for GitHub and Azure DevOps repositories and can be set to run from a specific repo, branch, commit or tag. ------------------------------------------------------------------------ @@ -50,7 +49,7 @@ Workflows also come with robust support for git/DevOps repositories and can be s ::: callout-note ## Source -The source by default is set to your Workspace, but it is recommended that you use a version controlled Git repository instead. This prevents you changing the code of a notebook during a workflow run as the tasks are sourced from a specific repository version, branch, commit or tag rather than a workbook you may be working on. +The source by default is set to your workspace, but it is recommended that you use a version controlled Git repository instead. This prevents you changing the code of a notebook during a workflow run as the tasks are sourced from a specific repository version, branch, commit or tag rather than a workbook you may be working on. ::: ::: callout-note @@ -66,12 +65,13 @@ Once development is complete and the workflow becomes business as usual it may b \ **Depends on** - a list of tasks that must be run before the start of this task\ **Run if dependencies** - Instructions for the conditions to run the task based on the dependencies set above. The default option is 'All succeeded' but there are also the following options:\ - \ - At least one succeeded\ - None failed\ - All done\ - At least one failed\ - - All failed\ + - All failed + \ + \ 7. After setting up a flow of tasks you will be presented with a graphical presentation of the workflow as seen below: ![](/images/ada-workflow-task-chart.png) @@ -82,9 +82,13 @@ Once development is complete and the workflow becomes business as usual it may b ------------------------------------------------------------------------ -Each time a workflow is ran DataBricks audits any input parameters, all outputs, the success and failure of each task, along with when it was run and who by. +Each time a workflow is run, Databricks audits:\ +- any input parameters\ +- all outputs\ +- the success and failure of each task\ +- when it was run and who by\ -This makes them a very powerful debugging tool as you can refer back to results from previous runs. This means that if your pipeline fails you can review the notebook(s) that failed and troubleshoot the issue. +This makes workflows a very powerful debugging tool as you can refer back to results from previous runs. This means that if your pipeline fails you can review the notebook(s) that failed and troubleshoot the issue. Workflows that fail also allow you to repair the workflow once you have found and fixed the issue. This prevents having to re-run the whole pipeline from scratch and allows it to pick up from the point where it failed. @@ -94,7 +98,6 @@ Workflows that fail also allow you to repair the workflow once you have found an ## Coded workflows ------------------------------------------------------------------------- Another useful aspect of workflows is that they can be defined and ran using code through the [DataBricks Jobs API](https://docs.databricks.com/api/workspace/jobs). diff --git a/ADA/git_databricks.qmd b/ADA/git_databricks.qmd index aaa67d2..1811897 100644 --- a/ADA/git_databricks.qmd +++ b/ADA/git_databricks.qmd @@ -1,5 +1,5 @@ --- - title: "Databricks and version control" + title: "Use Databricks with Git" ---
diff --git a/_quarto.yml b/_quarto.yml index 4aa26fe..8283967 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -108,17 +108,28 @@ website: - RAP/rap-faq.qmd - RAP/rap-statistics.qmd - section: "ADA and Databricks" - contents: + contents: + - text: "---" + - text: "Understanding Databricks" + - text: "---" - ADA/ada.qmd + - ADA/databricks_fundamentals.qmd + - ADA/databricks_notebooks.qmd + - ADA/databricks_workflows.qmd + - text: "---" + - text: "How to..." + - text: "---" - ADA/databricks_rstudio_sql_warehouse.qmd - ADA/databricks_rstudio_personal_cluster.qmd - ADA/git_databricks.qmd + - ADA/databricks_workflow_script_databricks.qmd + - ADA/databricks_workflow_script_rstudio.qmd format: html: theme: - light: cyborg - dark: united + light: [cyborg, theme-dark.scss] + dark: [united, theme-light.scss] code-copy: true highlight-style: printing code-overflow: wrap @@ -128,3 +139,4 @@ filters: - include-files.lua - newpagelink.lua - quarto + diff --git a/theme-dark.scss b/theme-dark.scss new file mode 100644 index 0000000..36e6734 --- /dev/null +++ b/theme-dark.scss @@ -0,0 +1,3 @@ +/*-- scss:defaults --*/ + +$sidebar-fg: #2a9fd6; diff --git a/theme-light.scss b/theme-light.scss new file mode 100644 index 0000000..7e44864 --- /dev/null +++ b/theme-light.scss @@ -0,0 +1,3 @@ +/*-- scss:defaults --*/ + +$sidebar-fg: #9c3815;